Capturing epigenomes at high-resolution for insight into genome function and metabolic disease risk

Fiona Allum

Department of Human Genetics Faculty of Medicine McGill University, Montréal April 2019

A thesis submitted to McGill University in partial fulfillment of the requirements of the degree of Doctor of Philosophy

© Fiona Allum 2019

“It does not do to dwell on dreams and forget to live” - Albus Dumbledore (J.K. Rowling)

2 Table of Contents

ABSTRACT ...... 6

RÉSUMÉ ...... 8

LIST OF ABBREVIATIONS ...... 10

LIST OF TABLES ...... 15

LIST OF FIGURES ...... 16

ACKNOWLEDGMENTS ...... 18

PREFACE ...... 20 Format of the Thesis ...... 20 Contribution of Authors ...... 21 Original Contribution to Knowledge ...... 23

CHAPTER 1: INTRODUCTION ...... 26 1.1 Studying Complex Traits: Focus on Metabolic Diseases ...... 26 1.1.1 Pathophysiology and Socioeconomic Burden ...... 26 1.1.2 Environmental and Genetic Factors Contributing to Disease ...... 27 1.2 Genome-wide Association Studies ...... 28 1.2.1 GWAS of BMI, fat distribution and lipids ...... 28 1.2.2 Limitations of GWAS ...... 29 1.2.3 Missing Heritability in Common Diseases ...... 30 1.2.4 Integrational Studies to Interpret Genetic Variants ...... 31 1.2.5 Expression Quantitative Trait Loci Studies ...... 32 1.3 Epigenetics to Link Environment and Genetics to Disease Phenotypes ...... 33 1.3.1 What is Epigenetics? ...... 33 1.3.2 Reference Epigenome Efforts ...... 34 1.3.3 Defining Regulatory Elements ...... 35 1.3.4 Coordinated Patterns of Epigenetic Regulation ...... 38 1.4 DNA methylation ...... 39 1.4.1 Roles of DNA Methylation across the Genome ...... 40 1.4.2 Methods to Study DNA Methylation ...... 41 1.4.3 Epigenome-wide Association Studies of Common Traits ...... 42 1.5 Rationale, Hypothesis and Objectives ...... 45

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING ...... 48 2.1 Bridging Statement between Chapter 1 and 2 ...... 48 2.2 Title, Authors and Affiliations ...... 50

3 2.3 Abstract ...... 53 2.4 Introduction ...... 54 2.5 Results ...... 58 2.5.1 First-generation Capture Panel Design for MCC-Seq ...... 58 2.5.2 Second-generation Panel Design for Comprehensive Profiling ...... 59 2.5.3 Sample-based Validation of MCC-Seq ...... 60 2.5.4 Population-based Validation of MCC-Seq ...... 62 2.5.5 Population-based Genotype Profiling by MCC-Seq ...... 65 2.5.6 EWAS of TG Levels using MCC-Seq ...... 66 2.5.7 Assessment of Loci Harbouring TG-associated CpGs ...... 68 2.5.8 Follow-up of the TG-associated Loci Mapping to CD36 ...... 70 2.6 Discussion ...... 72 2.7 Online Methods ...... 76 2.7.1 First-generation Panel Design ...... 76 2.7.2 Generation of Second-general Panel ...... 78 2.7.3 MCC-Seq Protocol ...... 79 2.7.4 MCC-Seq Methylation Profiling ...... 80 2.7.5 Illumina 450K Array Methylation Profiling ...... 81 2.7.6 Agilent SureSelect CpG Profiling and MCC-Seq Comparisons ...... 82 2.7.7 Trait-association Discovery Cohort ...... 83 2.7.8 DNA Isolation ...... 84 2.7.9 Identification of Hypomethylated Regions ...... 84 2.7.10 Genotyping ...... 85 2.7.11 Epigenome-wide Association of TG Levels ...... 86 2.7.12 Adipocyte Nuclei Isolation ...... 87 2.7.13 Transposase-accessible chromatin Sequencing ...... 87 2.7.14 Blood Cell Isolation ...... 88 2.7.15 RNA Sequencing ...... 89 2.8 Acknowledgements ...... 90 2.9 Additional Information ...... 92 Accession codes ...... 92 Competing financial interests ...... 92 The Multiple Tissue Human Expression Resource Consortium ...... 93 2.10 Main Tables and Figures ...... 94 2.10.1 Tables ...... 94 2.10.2 Figures ...... 95 2.11 Supplementary Materials ...... 100 2.11.1 Supplementary Tables ...... 100 2.11.2 Supplementary Figures ...... 108 2.11.3 Supplementary Data ...... 119

CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK ...... 120 3.1 Bridging Statement between Chapter 2 and 3 ...... 120 3.2 Title, Authors and Affiliations ...... 121 3.3 Abstract ...... 123

4 3.4 Introduction ...... 124 3.5 Results ...... 126 3.5.1 Adipose tissue epigenetic variants linked to plasma lipids ...... 126 3.5.2 Positioning of lipid-CpGs within regulatory elements ...... 127 3.5.3 Replication of lipid-linked adipose regulatory regions ...... 129 3.5.4 Functional annotation of lipid-CpGs ...... 131 3.5.5 Tissue-specificity of lipid-linked regulatory regions ...... 133 3.5.6 Genetic contribution to lipid-CpG methylation variability ...... 135 3.5.7 Regulation of lipid-linked adipose-specific enhancers ...... 137 3.6 Discussion ...... 141 3.7 Methods ...... 146 3.7.1 Sample collections ...... 146 3.7.2 MCC-Seq methylation profiling ...... 148 3.7.3 Epigenome-wide association of plasma lipid levels ...... 150 3.7.4 Positional mapping analyses ...... 151 3.7.5 Transcription factor binding site motif analysis ...... 152 3.7.6 Differential expression analyses ...... 153 3.7.7 Linking expression to methylation in MuTHER cohort ...... 154 3.7.8 Association of gene expression to lipids in MuTHER cohort ...... 155 3.7.9 Gene enrichment pathway analyses ...... 155 3.7.10 Conditional modelling of HDL-EWAS on SNPs ...... 155 3.7.11 Code availability ...... 156 3.8 Data Availability ...... 157 3.9 Acknowledgements ...... 158 3.10 Author Information ...... 160 Competing interests ...... 160 3.11 Main Tables and Figures ...... 161 3.11.1 Tables ...... 161 3.11.2 Figures ...... 162 3.12 Supplementary Materials ...... 169 3.12.1 Supplementary Tables ...... 169 3.12.2 Supplementary Figures ...... 175 3.12.3 Supplementary Notes ...... 189 3.12.4 Supplementary Data ...... 190

CHAPTER 4: GENERAL DISCUSSION ...... 191

CHAPTER 5: CONCLUSIONS AND FUTURE DIRECTIONS ...... 201

REFERENCES ...... 204

APPENDICES ...... 214 Appendix A: Significant contributions to other publications ...... 214 Appendix B: Copyright Permissions ...... 219 Appendix C: Ethics and Related Certificates ...... 220

5

Abstract

Complex diseases such as obesity and related co-morbidities are caused by a combination of underlying predisposing genetic and environmental factors. Genome- wide association studies (GWAS) have revealed complex trait-linked genetic variants with small effect sizes and enriched in non-coding regions, making translation into biological knowledge challenging. Epigenomic traits, such as DNA (CpG) methylation, can be used as a proxy to link genome and environment to phenotype and disease. Past epigenome-wide investigation efforts have mainly profiled bioavailable tissues (i.e. whole blood) using array-based methods that are biased towards CpG-dense regions such as promoters. However, complex trait-associated and tissue-specific CpGs have been found to be enriched in enhancers active in disease-relevant tissues. While whole-genome bisulfite sequencing enables full investigation of methylomes, its current high cost makes its applicability in large- scale epigenome-wide association studies (EWAS) impractical. We implemented a novel next-generation capture sequencing method - MethylC-Capture Sequencing, or MCC-Seq - that permits simultaneous and cost-efficient profiling of methylation and genotypes over user-defined genomic targets (up to 200Mb). We validated the accuracy of the method through technical comparisons with other available techniques. We developed two custom panels targeting up to ~4.5 million CpGs within the adipose-specific functional methylome and ~0.3 million SNPs from the Illumina HumanCore array. This allowed for downstream genotype imputations, thus making MCC-Seq a dual purpose next-generation sequencing method. Using deeply phenotyped population-based and clinical cohorts of visceral adipose tissue and whole blood, we applied MCC-Seq in over 500 samples and revealed unprecedented methylation variants linked to plasma lipid traits in multiple EWAS. We identified 567 lipid-linked regulatory regions and showed that these are enriched in adipose- specific enhancer regions, highlighting the need for an expanded catalog of interrogated distal regulatory regions in complex disease studies. We used the single-

6 base profiling capacity of MCC-Seq to present novel localization patterns of metabolic trait-linked CpGs at regulatory regions and fine-map epigenetic signals from low- density coverage studies of almost 700 independent adipose samples. Through our efforts, we generated unique reference maps of methylome, transcriptome, and chromatin accessibility in the obesity-targeted visceral adipose tissue including isolated adipocytes. We applied integrational approaches using these maps and other ‘omics’ layers to show features of the identified trait-linked regions, highlighting cis- acting regulatory elements that depict putative pleiotropic effects on gene expression levels. We showed that a large proportion (>55%) of lipid-associated adipose regulatory regions are under genetic regulation, with this fraction being strengthened (>93%) at elements replicating in matched bioavailable tissues (i.e. whole blood samples). We further highlighted an enrichment for genetic regulation by GWAS SNPs of metabolic disease-linked traits, including at GALNT2. In all, the findings presented in this thesis show the advantage of using high-resolution epigenetic profiling in regulatory elements active in diseases-relevant tissues to provide novel genetic and epigenetic variants in large-scale studies of complex traits.

7 Résumé

Les maladies complexes telles que l’obésité et ses comorbidités associées sont causées par l’action commune de facteurs de prédispositions génétiques et environnementaux. Les études d’association pangénomiques (GWAS) ont permis d’identifier des facteurs génétiques communs associés à ces maladies, mais ayant une valeur prédictive limitée. De plus, due à leur localisation dans la portion non codante du génome, ceux- ci sont difficiles à lier aux mécanismes de la pathologie. Les changements épigénétiques, tel que la méthylation de l’ADN (CpG), peuvent être utilisés pour lier la génétique et l’environnement aux phénotypes et maladies complexes. Les études d’association panépigénomiques (EWAS) publiées jusqu’à maintenant ont plutôt profilé des tissus bio-disponibles (ex: sang complet) et employé des méthodes ciblant préférentiellement les régions promotrices du génome. Cependant, des études précédentes ont démontré que les CpGs associés aux traits complexes et spécifiques aux tissus sont enrichis dans les éléments activateurs de tissus liés aux maladies. De plus, le coût élevé du séquençage complet du méthylome (WGBS; whole-genome bisulfite sequencing) rend cette méthode difficilement applicable dans les EWAS à grande échelle. Nous avons implémenté une approche rentable pour la capture de séquençage de prochaine génération - MethylC-Capture Sequencing (MCC-Seq) - permettant de profiler simultanément le méthylome et le génome chevauchant des cibles définies par l’utilisateur (jusqu’à 200Mb). Nous avons validé le degré de précision de la méthode à travers des comparaisons de données de méthylation collectionnées utilisant des techniques établies. Nous avons développé deux panels de capture ciblant jusqu’à 4.5M de CpGs chevauchant le méthylome fonctionnel du tissu adipeux et 0.3M de variants génétiques compris sur le Illumina HumanCore array, permettant ainsi l’imputation de données génétiques additionnelles. Cela démontre que MCC-Seq est une méthode de séquençage à double usage. Utilisant des cohortes bien caractérisées de tissue adipeux viscéral (VAT) et de sang complet, nous avons appliqué MCC-Seq dans plusieurs EWAS permettant l’identification de nouveaux variants épigénomiques corrélés aux niveaux de lipides. Nous avons

8 caractérisé 567 régions régulatrices liées à ces traits complexes étant enrichies parmi les éléments activateurs du tissu adipeux, démontrant ainsi l’utilité d’interroger ces régions régulatrices dans les études des maladies complexes. MCC-Seq a la capacité de profiler chaque base dans les régions capturées, nous permettant ainsi de présenter des motifs de localisation novateurs de variants épigénomiques liés aux traits complexes. De plus, cette méthode permet de cartographier les signaux épigénomiques identifiés par des études antérieures (~700 échantillons de tissu adipeux) ayant utilisé des techniques de profilage à couverture de faible densité. A travers nos études, nous avons généré des références uniques de cartes épigénomiques détaillant les profiles du méthylome, du transcriptome et d’accessibilité de la chromatine du VAT et adipocytes isolés. L’incorporation de ces cartes épigénomiques dans nos analyses nous a permis de caractériser nos régions régulatrices et d’identifier des effets pléiotropiques putatifs les liant à l’expression de gènes. Nous démontrons qu’une grande majorité des régions régulatrices associées aux lipides sont sous contrôle génétique (>55%) et que cette proportion était renforcée (>93%) aux sites répliqués dans des tissus bio-disponibles (ex: sang complet). De plus, nous constatons un enrichissement d’effets génétiques par des variants GWAS liés à des traits métaboliques complexes, incluant GALNT2. En résumé, les résultats présentés dans cette thèse démontrent les avantages de notre méthode pour l’évaluation exhaustive de la variation génétique et épigénétique dans des tissus biologiquement pertinents et leur impact dans la pathologie des maladies complexes.

9 List of Abbreviations

ABCC5 ATP binding cassette subfamily C member 5

ABCG1 ATP binding cassette subfamily G member 1

ABCG5 ATP binding cassette subfamily G member 5

ABCG8 ATP binding cassette subfamily G member 8

AT Adipose tissue

ATAC-Seq Assay for transposase-accessible chromatin sequencing

AKT1 AKT serine/threonine kinase 1

BDNF Brain derived neurotrophic factor

BMI Body mass index

BMP4 Bone morphogenetic 4

BS Bisulfite sequencing

CD7 CD7 molecule

CD36 CD36 molecule

C/EBPa CCAAT enhancer binding protein alpha

CERK Ceramide kinase

ChIP-Seq Chromatin immunoprecipitation sequencing

CIHR Canadian Institutes of Health Research

CGI CpG island

CpG 5’– cytosine – phosphate – guanine – 3’

CPT1A Carnitine palmitoyltransferase 1A

CSK C-terminal Src kinase

10 DNA Deoxyribonucleic acid

DNaseI-Seq DNase I hypersensitive sites sequencing

DNMT1 DNA methyltransferase 1

DNMT3A DNA methyltransferase 3 alpha

DNMT3B DNA methyltransferase 3 beta

ECHS1 Enoyl-CoA hydratase, short chain 1

ENCODE Encyclopedia of DNA Elements eQTL Expression-linked quantitative trait loci

EWAS Epigenome-wide association study

FDR False discovery rate

FRSQ Fonds de la recherche en santé du Québec

FTO Fat mass and obesity-associated protein

GALNT2 Polypeptide N-acetylgalactosaminyltransferase 2

GDF7 Growth differentiation factor 7

GEO Gene Expression Omnibus

GLM Generalized linear model

GNA15 G protein subunit alpha 15

GNG7 G protein subunit gamma 7

GTEx Genotype-Tissue Expression consortium

GWAS Genome-wide association study

HDAC4 Histone deacetylase 4

HDL-C Plasma high-density lipoprotein

11 HGP project

HIFA3 Hypoxia inducible factor 3 subunit alpha hQTL Histone quantitative trait loci

IDH2 Isocitrate dehydrogenase 2

IFG Impaired fasting glucose

IHEC International Human Epigenome Consortium

INAF Institute of Nutrition and Functional Foods

IUCPQ University Institute of Cardiology and Respirology of Quebec

LCL Lymphoblastoid cell lines

LCN2 Lipocalin 2

LD Linkage disequilibrium

LDL-C Plasma low-density lipoprotein

LMR Low-methylated region

MCC-Seq MethylC-capture sequencing

MC4R Melanocortin 4 receptor

MetS Metabolic syndrome

MetV1 Metabolic-specific MCC-Seq target panel generation 1

MetV2 Metabolic-specific MCC-Seq target panel generation 2 metQTL/mQTL Methylation-linked quantitative trait loci

MHC Major histocompatibility complex

MKNK2 MAP kinase interacting serine/threonine kinase 2

MuTHER Multiple Tissue Human Expression Resource

12 NFIB Nuclear factor I B

NGS Next-generation sequencing

NHGRI National Human Genome Research Institute

NHS National Health Service

NIH National Institutes of Health

NIHR National Institute for Health Research

PPARg Peroxisome proliferator-activated receptor gamma

QTL Quantitative trait loci

REEP6 Receptor accessory protein 6

RNA Ribonucleic acid

RNA-Seq RNA sequencing

RPTOR Regulatory associated protein of MTOR complex 1

RRBS Reduced representation bisulfite sequencing

RUNX1 Runt related transcription factor 1

SD Standard deviation

SLCO3A1 Solute carrier organic anion transporter family member 3A1

SNP Single-nucleotide polymorphism

SPPL2B Signal peptide peptidase like 2B

SREBF1 Sterol regulatory element binding transcription factor 1

STAT Signal transducer and activator of transcription

STAT1 Signal transducer and activator of transcription 1

STAT3 Signal transducer and activator of transcription 3

13 STAT5A Signal transducer and activator of transcription 5A

TC Total plasma cholesterol

TET Ten eleven translocation

TF Transcription factor

TFBS Transcription factor binding site

TG Triglyceride

TSS Transcription start site

T2D Type 2 diabetes

UCSC University of California Santa Cruz

UCSF University of California, San Francisco

UMR Un-methylated region

VAT Visceral adipose tissue

VGLL3 Vestigial like family member 3

WGBS Whole-genome bisulfite sequencing

14

List of Tables

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING ...... 48

2.10.1 Tables ...... 94

Table 1. Composition of Met V1 and Met V2 Panels ...... 94

2.11.1 Supplementary Tables ...... 100

Supplementary Table 1. Sequence statistics of the MetV1 pooled samples ...... 100 Supplementary Table 2. Sequence statistics of the Met V2 pooled samples ...... 102 Supplementary Table 3. Comparison of MCC-Seq methylation calls with Illumina 450K array and WGBS data at various read depths ...... 103 Supplementary Table 4. Comparison of MCC-Seq methylation calls with Illumina 450K array and WGBS data at various read depths excluding completely hypo and hypermethylated CpGs ...... 104 Supplementary Table 5. MuTHER replication and cis-mQTL regulation of top TG-associated CpGs ...... 105 Supplementary Table 6. ATAC-Seq PCR amplification primers ...... 106 Supplementary Table 7. ATAC-Seq Q-PCR primers ...... 107

CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK ...... 120

3.11.1 Tables ...... 161

Table 1. Genetic regulation on lipid-linked adipose regulatory regions ...... 161

3.12.1 Supplementary Tables ...... 169

Supplementary Table 1. Characteristics of the study cohorts ...... 169 Supplementary Table 2. Size and CpG density characterization of adipose regulatory regions ..... 170 Supplementary Table 3. Overlap between discovery CpGs versus those on the EPIC and 450K array at adipose regulatory regions ...... 171 Supplementary Table 4. Transcription factor binding site motifs at regions flanking replicated MuTHER lipid-CpGs mapping to UMRs ...... 172 Supplementary Table 5. Top canonical pathways for modulated by replicated lipid-linked regulatory regions and further linked to the same circulating lipid traits ...... 173 Supplementary Table 6. Top canonical pathways for genes overlapping lipid-linked regulatory regions replicating in whole blood ...... 174

15

List of Figures

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING ...... 48

2.10.2 Figures ...... 95

Figure 1. Technical replication of MCC-Seq methylation calls and comparison with WGBS ...... 95 Figure 2. Comparison of methylation calls obtained with different methods ...... 96 Figure 3. Annotation of triglycerides (TG)-associated CpGs in putative regulatory regions ...... 97 Figure 4. Top TG-associated CpG mapping to an AT-specific regulatory region – CD36 ...... 99

2.11.2 Supplementary Figures ...... 108

Supplementary Figure 1. Extended comparison of MCC-Seq methylation calls with WGBS and the Illumina 450K array excluding completely hypo and hypermethylated CpGs ...... 108 Supplementary Figure 2. Correlation between Illumina 450K array and MCC-Seq methylation calls at different read coverage ...... 109 Supplementary Figure 3. Comparison of MCC-Seq methylation calls with Agilent SureSelect ..... 110 Supplementary Figure 4. Distributions of sequence coverage at included CpG sites ...... 111 Supplementary Figure 5. Outline of the trait association and population-based validation studies ...... 112 Supplementary Figure 6. Average methylation pattern of CpGs captured with MCC-Seq Met V1 design ...... 113 Supplementary Figure 7. Characterization of adipose hypomethylated footprints ...... 114 Supplementary Figure 8. Variability of enhancer and promoter associated CpG sites ...... 115 Supplementary Figure 9. CpG-by-CpG correlation between Illumina 450K array and MCC-Seq methylation calls in 24 samples ...... 116 Supplementary Figure 10. Comparison of the observed heterozygosity from MCC-Seq and HumanOmni BeadChip array genotyping calls ...... 117 Supplementary Figure 11. Distribution of triglycerides levels in the discovery cohort ...... 118

CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK ...... 120

3.11.2 Figures ...... 162

Figure 1. Study flow chart ...... 162 Figure 2. Positional mapping of lipid-CpGs within adipose tissue regulatory elements ...... 163 Figure 3. TG-linked adipose-specific regulatory region shows putative pleiotropic effects ...... 165 Figure 4. HDL-C linked adipose-specific regulatory region under genetic regulation ...... 167

3.12.2 Supplementary Figures ...... 175

Supplementary Figure 1. QQplots for EWAS of TG to methylation associations before and after correction ...... 175

16

Supplementary Figure 2. QQplots for EWAS of HDL to methylation associations before and after correction ...... 176 Supplementary Figure 3. QQplots for EWAS of LDL to methylation associations before and after correction ...... 177 Supplementary Figure 4. QQplots for EWAS of TC to methylation associations before and after correction ...... 178 Supplementary Figure 5. Significant associations between methylation and lipid phenotypes in the discovery cohort ...... 179 Supplementary Figure 6. Methylation range variance across CpGs within the discovery cohort ... 180 Supplementary Figure 7. Annotation of lipid-CpGs among adipose tissue regulatory elements .... 181 Supplementary Figure 8. Mean CpG coverage across adipose tissue regulatory elements ...... 183 Supplementary Figure 9. Positional mapping of CpGs overlaying Illumina 450K and EPIC array probes ...... 184 Supplementary Figure 10. Expression profile of STAT5A across the multiple tissues in GTEx ..... 185 Supplementary Figure 11. Expression profile of STAT3 across the multiple tissues in GTEx ...... 186 Supplementary Figure 12. Expression profile of NFIB across the multiple tissues in GTEx ...... 187 Supplementary Figure 13. Genomic distance between CpGs and their top associated SNP ...... 188

17

Acknowledgments

This doctoral work would not have been possible without the support, knowledge and kindness of many individuals along the journey.

To my supervisor Dr. Elin Grundberg. I will always be grateful for the trust you placed in me to carry these projects to completion. I thank you for the opportunity to work in this cutting-edge field. It’s been a fast-paced ride but I wouldn’t change a thing! I further thank you for your availability and advice every step of the way. You are an inspiration for women striving to make an impact in our field.

To my supervisory committee members. Thank you, Drs Jamie Engert and Mathieu Blanchette for your insightful scientific input throughout my projects. I am particularly grateful to Jamie for stepping in as co-supervisor after Elin pursued exciting opportunities in Kansas City.

To the funding agencies that supported my projects. I would like to thank the Fonds de la recherche en santé du Québec (FRSQ) and the McGill University Health Centre for funding my projects throughout my PhD. I would further like to thank the various agencies that provided me with travel opportunities to present my work at national and international conferences; Canadian Epigenetic, Environment and Health Research Consortium, Keystone Symposia and McGill Human Genetics Department.

To my Grundberg lab colleagues. Gros merci à Marie-Michelle, Élodie et Albena pour votre support à travers les années. Additional thanks to Jinchu for her kindness and for being my travel roommate. I would also like to thank Xiaojian, Bing, and Warren for their patience and assistance through my steep learning curve of the bioinformatics world.

18

To Tony. A lot of gratitude to you my friend for the many science – but mostly life - conversations. Your support and kindness was instrumental throughout this process. I thank you for all your encouraging words.

To my sixth-floor admin office colleagues. Thank you to Dan, Markus, Guillaume, Tanya, Ksenia, Lindsay, Jasmine and Alfredo for their company and chats during much needed coffee breaks.

To the Genome Centre. Thank you for the opportunity to work in a stimulating environment. It was a pleasure to organize journal clubs and Holiday parties for such a great group of individuals.

To the Human Genetics Department. I would like to thank Ross McKay, Dr. Aimee Ryan, and Dr. Anna Naumova for their great patience in answering my many enquiries at various points of my journey. An extra thanks to Ross for the great yearly Research Day and BBQ events that were always a joy to attend. I am also grateful to have completed my studies with a great cohort of students.

To my grad school pals – past and present. I would like to offer my thanks to Erika, LeeAnn, John, Eric, Nick, and Andréanne for their friendship and constant support throughout these years. You’ve made them memorable for all the good reasons.

To my mom. You are the strongest woman I know and a constant inspiration for me to keep reaching for my goals. Although we have vastly different interests in our careers, I am grateful for your constant support and for being my sounding board on life matters. Thank you for your unconditional love.

And to Dylan. Your thoughtfulness has kept me positive throughout the finalization of this thesis. Thank you for bringing lightness into my everyday life. The next chapter seems even brighter with you and Mini by my side.

19

Preface

Format of the Thesis

This doctoral thesis is presented in the style of the manuscript-based format following regulations provided by the Graduate and Postdoctoral Studies and the Human

Genetics Departments of McGill University. This work is comprised of five chapters and the Appendix section. Chapter 1 presents a literature review to situate the reader within the context of the presented work. Chapter 2 and 3 are both manuscripts published in Nature Communications. Chapter 4 serves as a general discussion of the works presented in this thesis. Chapter 5 presents conclusions and future aims expanding from this study. Finally, a short description of contributions to other manuscripts during my doctoral studies is found in Appendix A.

20

Contribution of Authors

All presented works have been completed under the supervision of Dr. Elin

Grundberg at the Department of Human Genetics of McGill University.

The manuscript presented in Chapter 2 was published in Nature Communications in

May of 2015 and is authored by Fiona Allum, Xiaojian Shao, Frédéric Guénard,

Marie-Michelle Simon, Stephan Busche, Maxime Caron, John Lambourne, Julie

Lessard, Karolina Tandre, Åsa K Hedman, Tony Kwan, Bing Ge, The Multiple Tissue

Human Expression Resource Consortium, Lars Rönnblom, Mark I McCarthy, Panos

Deloukas, Todd Richmond, Daniel Burgess, Timothy D Spector, André Tchernof,

Simon Marceau, Mark Lathrop, Marie-Claude Vohl, Tomi Pastinen, and Elin

Grundberg. E.G. and M.L. conceived the study. T.P. provided conceptual ideas for the study. E.G., T.P., T.R., D.B., M.C.V. and A.T. designed experiments. F.A., X.S. and S.B. analysed data. F.A., M.M.S., F.G., J. Lambourne and J. Lessard performed experiments. T.K., B.G. and M.C. provided bioinformatics support. S.M. F.G, J.

Lessard, K.T., A.T, L.R., T.D.S. and M.C.V. collected, prepared and/or provided the clinical samples. A.K.H., M.M. and P.D. provided replication data. F.A., M.L. and E.G. drafted the manuscript. All authors reviewed and contributed feedback on the final manuscript. Specifically, I worked in the laboratory with M.M.S. to implement the

MCC-Seq protocols (2.5.1, 2.7.3). I generated the input sequences for the second- generation adipose-specific panel design (2.5.2, 2.7.2, 2.7.9). I performed all downstream analyses for the interpretation of the triglyceride-linked epigenome-wide association study (EWAS) results (2.5.6., 2.5.7, 2.5.8). I performed the RNA-Seq

21

differential analyses (2.7.15). I generated all main and supplementary figures and tables relating to these sections (2.10, 2.11). I drafted all sections of the manuscript with major edits from E.G and M.L..

Chapter 3 is a manuscript published in Nature Communications in March of 2019 and is authored by Fiona Allum, Åsa K Hedman, Xiaojian Shao, Warren A Cheung,

Jinchu Vijay, Frédéric Guénard, Tony Kwan, Marie-Michelle Simon, Bing Ge,

Cristiano Moura, Elodie Boulier, Lars Rönnblom, Sasha Bernatsky, Mark Lathrop,

Mark I McCarthy, Panos Deloukas, André Tchernof, Tomi Pastinen, Marie-Claude

Vohl, and Elin Grundberg. E.G. conceived the study. E.G., T.P., M.C.V., A.T. and M.L. designed experiments. F.G, L.R., A.T, and M.C.V. collected, prepared and/or provided the clinical samples. E.G. and F.A. lead data analyses. F.A., M.M.S. and E.B. performed experiments. F.A., A.K.H., W.A.C., X.S. and J.V. analyzed data. T.K. and

B.G. provided bioinformatics support. A.K.H., M.I.M. S.B, C.M. and P.D. provided replication data. F.A. generated figures with contributions from W.A.C.. F.A. and E.G. drafted the manuscript. All authors reviewed and contributed feedback on the final manuscript. Precisely, I collaborated with M.M.S. in the laboratory to generate the

MCC-Seq data for the Quebec Heart and Lung Institute (IUCPQ) cohort (3.7.2). I carried out the EWAS of plasma lipid levels in the IUCPQ cohort (3.5.1, 3.7.3) and all downstream analyses to dissect these findings (3.5.2, 3.5.3, 3.5.4, 3.5.5, 3.5.6, 3.5.7,

3.7.5, 3.7.7, 3.7.8, 3.7.9, 3.7.10). I generated all figures and tables in the manuscript

(3.11, 3.12). Finally, I wrote the manuscript with main edits from E.G..

22

Original Contribution to Knowledge

This doctoral work contributed a novel sequencing method to aid our understanding of genetic and epigenetic variants underlying complex traits – with a focus on metabolic disease-related traits. We used integrational approaches that incorporate

‘omics’ datasets from in-house and publicly available consortia projects to fine-map identified lipid-linked variants. We focused our efforts on disease-relevant tissues and cells and presented an expanded catalogue of epigenetic variants linked to metabolic traits.

The first study described in Chapter 2 is entitled “Characterization of functional methylomes by next-generation capture sequencing identifies novel disease- associated variants”. This work presents our implementation of the methylC-capture sequencing (MCC-Seq) method, which permits cost-effective and simultaneous single- base resolution methylation and genotype profiling over custom target regions (up to

200Mb). Utilizing adipose tissue as a model, two custom panels that aimed at capturing regulatory regions were designed, with the latter of these panels being subsequently licensed by Roche NimbleGen (SeqCap Epi Developer XL Design

#131010_HG19_EG_met_EPI). Through technical and methodological comparisons,

MCC-Seq was established to be as accurate as alternative approaches both in terms of methylation and genotyping profiling. We then performed a proof-of-concept epigenome-wide association study (EWAS) linking blood lipid levels to methylation status by applying MCC-Seq in a cohort of visceral adipose tissue (VAT) from 72 obese donors (Quebec Heart and Lung Institute; IUCPQ). Applying this technique, we

23

contributed high resolution maps of methylation signatures in regulatory elements of VAT. Within this work, we further generated the first chromatin accessibility landscape (Assay for transposase-accessible chromatin sequencing; ATAC-Seq) and transcriptomic profiles for metabolic disease-relevant cells (i.e. purified depot-specific adipocytes). We utilized these and other tissue-specific ‘omics’ layers (i.e. NIH

Roadmap Epigenomics Consortium; Roadmap) to characterize identified complex trait-linked epigenetic variants enriched in tissue-specific enhancer regions, highlighting metabolically-linked CD36 as our top locus. This study led to a collaboration with the Abumrad group at Washington University (Appendix A) where we used datasets from Chapter 2 to further functionally interpret lipid trait- associated CD36 SNPs identified within the GOLDN cohort (Genetics of Lipid

Lowering Drugs and Diet Network study; N= 1,117).

The work depicted in Chapter 3 is titled “Dissecting features of epigenetic variants underlying cardiometabolic risk using full-resolution epigenome profiling in regulatory elements”. Here, we present a large-scale study, applying the method summarized in Chapter 2, that investigates methylation to plasma lipid associations in VAT of ~200 obese individuals and matched whole blood samples, representing the largest available VAT resource to date. Profiling with MCC-Seq protocols permitted us to interrogate an unprecedented ~1.3M dynamic CpGs within tissue-specific regulatory regions. The single-base mapping resolution at these elements enabled us to generate novel positional maps of complex trait-linked epigenetic variants. Our study design permitted us to investigate features of lipid-linked regulatory regions

24

replicating across tissues from a disease-linked tissue (i.e. VAT) to a bioavailable tissue (i.e. whole blood), whereas most studies have presented the inverse scheme.

We further used large population-based cohorts of adipose and whole blood tissue samples (Multiple Tissue Human Expression Resource [MuTHER] and CARTaGENE consortia; N~800 total) to replicate our findings from a disease- to a population-based cohort. We dissected the impact of genetic effects at adipose regulatory regions and showed enrichment in regulation by publicly available genome-wide association studies (GWAS) single-nucleotide polymorphisms (SNPs) for the same lipid traits. We performed detailed follow-up on one such adipose regulatory region intragenic to

GALNT2 under regulation by an HDL-linked GWAS locus. We further presented an expanded dataset of methylation to gene expression associations in adipose tissue, spanning +/-1 Mb from investigated regulatory regions. Through integrational and replication studies, we highlighted key regulatory regions under non-genetic regulation that exhibit putative pleiotropic effects on expression levels of non- proximal genes. In all, we catalogued over 550 previously unreported regulatory regions linked to cardiometabolic risk, exemplifying the discovery value of using disease-linked tissues in epigenome-wide studies.

25 CHAPTER 1: INTRODUCTION

Chapter 1: Introduction

1.1 Studying Complex Traits: Focus on Metabolic Diseases

Deciphering the etiology of common diseases still remains an ongoing focus for research in human populations1. Examples of complex traits include height, asthma, obesity and other related metabolically-linked co-morbidities such as type 2 diabetes

(T2D). While monogenic (i.e. Mendelian) phenotypes can be attributed to a single locus of high penetrance with minimal contributions from other modulating loci and environmental factors, the study of complex traits poses a more interesting challenge, as they are known to result from the interplay between different genetic loci with small effect sizes2-8 and other external factors. This thesis focused on elucidating the genetic and epigenetic factors underlying obesity and related common traits.

1.1.1 Pathophysiology and Socioeconomic Burden

Obesity is an important health concern. Over one third of the global population is considered to be obese with current estimates on an upward trajectory9.

Categorization as obese or overweight is based on body mass index (BMI) cutoffs of

³30 kg/m2 and ³25 kg/m2, respectively. Essentially, this disease is characterized by an imbalance in energy intake versus energy usage and storage resulting in fat accumulation10. Obesity often falls under the umbrella of metabolic syndrome where clinical detection is determined by the incidence of two out of four diagnoses in addition to abdominal obesity: elevated triglycerides (TG), low HDL-C, elevated fasting glucose (impaired fasting glycemia or T2D), and/or elevated blood pressure11.

26 CHAPTER 1: INTRODUCTION

Obesity is also a heterogeneous disease in terms of fat tissue distribution (e.g. subcutaneous vs intraabdominal), amount of fat accumulation, and associated co- morbidities. Given the availability of BMI estimates from clinical data12, many studies have used this value as a proxy of obesity status with some success. However,

BMI values do not necessarily reflect the heterogeneous features of this disease.

Similar BMI status can be calculated for individuals with vastly different metabolic risk profiles in terms of fat distribution and plasma lipid reports. For instance, we know that enlarged intraabdominal visceral fat depots are specifically linked to worse health outcomes10,13. Recently, more success has been found when focusing on obesity- linked intermediate traits such as lipid levels, which we concentrate on in this work.

1.1.2 Environmental and Genetic Factors Contributing to Disease

Obesity is known to be caused by joint action of genetic and environmental factors wherein exposure to environmental factors in the context of a pre-disposing genetic background leads to higher disease incidence14. Estimates of genetic contribution to complex traits are varied. Although the projected impact of genetics on the variance of obesity-related traits such as BMI was originally estimated from twin studies to be

40-70%12,15,16, other more recent studies applying new statistical tools point towards estimates closer to 30-40%17. Deciphering the genetic component in complex diseases involves the identification of single nucleotide polymorphisms (SNPs) linked to relevant quantitative traits. On the other hand, investigating environmental factors underlying complex traits in human populations is challenging due to a lack of

27 CHAPTER 1: INTRODUCTION controlled experimental conditions that does not permit us to tease out the impact of these effects individually. Certain behavioral patterns have also been linked to the onset of obesity including; (1) overeating, (2) diets high in processed foods, (3) sedentary lifestyle, (4) lack of sleep, (5) certain medications, among others10.

1.2 Genome-wide Association Studies

The types of genetic studies used to investigate the causal genes contributing to complex traits have evolved with the advent of technological advances in the field.

Three clear waves of studies can be outlined18; (1) family-based linkage analyses permitted the identification of genes linked to more extreme early-onset Mendelian forms of the studied phenotypes, underlying possible pathways for complex forms of the disease, (2) association tests of candidate genes allowed for the informed investigation of the contribution of genes in suspected common trait-linked pathways to disease etiology, and (3) genome-wide association studies, as of the mid-2000s with the development of single-base sequencing technologies, enabled unbiased investigations of complex trait-linked genetic variants on a genome-wide scale. We focus our review on the latter and more prominent methodology.

1.2.1 GWAS of BMI, fat distribution and lipids

Due to the heterogeneous nature of metabolic syndrome and obesity, studying this condition as a whole has yielded few results that lack cohesion18. As such, the main complex trait studied in GWAS of obesity has been BMI. Major findings of BMI-linked

28 CHAPTER 1: INTRODUCTION genetic variants include those mapping to the FTO locus19 and other genes such as

BDNF and MC4R20, implicating a role for the central nervous system in obesity etiology10,18,20. Of note, the latest large-scale meta-analyses led by the GIANT consortium leveraged the BMI status for ~700,000 individuals and yielded 941 trait- linked SNPs that together explained ~6% of BMI variance21. Measures of body fat distribution such as waist-to-hip ratio (i.e. visceral vs subcutaneous proxy ratio) were also investigated to decipher the impact of genetic variants on obesity. GWAS SNPs associated to fat distribution traits were found to map to genes (>50 loci) enriched in adipogenesis function18,22. Similarly to GWAS of BMI, the identified fat distribution- associated GWAS SNPs account for less than 5% of individual trait variance10. More success has been found in GWAS of lipid traits2-8. These studies have been conducted mainly in the context of cardiovascular diseases, revealing ~370 loci enriched in pathways linked to metabolic phenotypes. Cumulatively, these GWAS SNPs explain under ~15% of the phenotypic variance per trait. Lipid profiles are used in the diagnosis of metabolic risk and obesity, therefore investigating their functional impact in disease relevant tissues such as adipose tissue is needed.

1.2.2 Limitations of GWAS

GWAS have been successful in identifying thousands of common variants associated with complex traits23, which together have underlined potential biological pathways involved in disease onset and progression. However, some caveats of these studies should be highlighted. Genetic variants identified through GWAS lie within genomic

29 CHAPTER 1: INTRODUCTION blocks that exhibit linkage disequilibrium (LD), adding a layer of complexity in pin- pointing the causal variants contribution to disease risk24. Another important observation has been the enrichment in genomic distribution of diseased-linked SNPs to non-coding regions18,25 as opposed to exonic regions where downstream effects on gene products (e.g. amino acid change) can more easily be attributed. Specifically,

GWAS SNPs linked to complex disease phenotypes have been found to be enriched in regulatory regions of trait-linked tissues and cells25-28. For instance, GWAS SNPs of immune-related phenotypes such as type 1 diabetes and allergies have been reported to be enriched in active regulatory regions of immune cells27. Additionally, the multifactorial disease-linked GWAS variants identified so far were noted to contribute only a modest proportion to disease heritability24. Taken together, these factors make translation of GWAS findings into biological knowledge very challenging12,18.

1.2.3 Missing Heritability in Common Diseases

Statistical power in GWAS to identify complex disease-linked genetic variants is directly related to cohort size and allele frequency. To date, GWAS of circulating lipids2-8 have cumulatively identified ~370 loci that together explain less than ~15% of the lipid-traits variance. This estimate takes into account the latest GWAS release where 118 novel loci were identified in a large cohort of ~300,000 multi-ethnic individuals, contributing less than 1% phenotypic variance per trait4. Although

GWAS of increasing cohort size persistently uncover novel phenotype-linked loci, a

30 CHAPTER 1: INTRODUCTION significant fraction of common disease heritability remains unsolved. Different genetic and environmental components have been suggested to explain this “missing” heritability in complex traits.

GWAS are conducted using array technologies that mostly target common SNPs with minor allele frequencies above 5% as tag SNPs. Due to this technical bias, GWAS do not allow for full assessment of rare variants29 that may be contributing to complex traits. Direct sequencing of these variants30 is needed followed by adapted statistical tests to account for the low frequency of these SNPs in populations29. Similarly, very few structural variants (e.g. copy-number variations) are currently included in genotyping arrays. Another confounding factor in accurately estimating the contribution of variants in complex traits can be attributed to possible gene-gene interactions31 between already phenotype-linked loci. Similarly, gene-environment interactions are known to be impactful in common diseases but are problematic to assess in human studies. Fine-mapping efforts to investigate these additional sources of genetic contribution are currently underway.

1.2.4 Integrational Studies to Interpret Genetic Variants

Integrational approaches can be used in order to help infer biological functions underlying identified complex trait-linked loci from GWAS. A main strategy employed is to functionally annotate variants through additional association studies of cellular traits in disease-relevant tissues32. The assumption using this strategy is to aid the interpretation of the downstream effects of the disease-linked genetic variants, which

31 CHAPTER 1: INTRODUCTION as mentioned above, can be hard to identify due to their enrichment in non-coding

DNA. Cellular traits investigated for these purposes are diverse but mainly focus on gene expression levels of cis-located genes (expression quantitative trait loci; eQTL) or epigenetic traits such as DNA methylation levels (methylation QTL; metQTL; mQTL) or chromatin marks’ peaks (histone QTL; hQTL). These types of association studies are discussed in more detail in the sections below.

1.2.5 Expression Quantitative Trait Loci Studies

The first wave of integrational studies to assign putative function to GWAS SNPs involved linking these genetic variants with expression levels of cis-locating genes in single tissue or cell samples33-35. Analyzing eQTL signals across tissues of populations provided additional insight into the tissue-specific and cell lineage-specific nature of

SNP-expression correlations, indicating the importance of tissue selection in disease dissection36-38. Recently, large-scale efforts from the Genotype-Tissue Expression

(GTEx) consortium have reinforced this latter point as well as provided publicly available eQTL data across >50 tissues39-41. Importantly, such studies have the ability to point out expression-linked SNPs adjacent but in LD with GWAS SNPs, providing a fine-mapping tool for the identification of causal SNPs acting on the associated complex trait. However, evidence exists for the effect of cellular epigenetic traits on gene expression not mediated by genetic effects, which indicates that integrational epigenetic studies are needed38. The strategy employed in eQTL studies has been applied to other cellular traits discussed below.

32 CHAPTER 1: INTRODUCTION

1.3 Epigenetics to Link Environment and Genetics to Disease Phenotypes

Epigenetic marks are known to be variable across cell types and individuals with genetic and environmental factors mainly impacting their status. In this way, studying epigenomes can help provide better insight into disease biology by linking genetics and environment to disease and phenotype. As previously mentioned, epigenetic marks can be used to link complex trait-GWAS SNPs to function.

Additionally, investigating epigenetic cellular traits also permits the detection of epigenetic-environment interactional effects, the prevalence of which is currently unknown. Assessing the impact of epigenetics on chromatin accessibility and complex traits is still underway.

1.3.1 What is Epigenetics?

Epigenetics was traditionally coined to signify an umbrella of unexplainable non- genetic but genetic-related biological events42,43. Since then, this term has evolved to define a fast-paced subfield of biology that studies stable and reversible marks

“above” (i.e. epi) the DNA level (i.e. genetics) that contribute to gene expression patterns within a cell42,44. Although the genome of an individual is stable across cell types, obvious functional differences exist between these that can be attributed to variances in transcriptomes directed by epigenomes. The mechanisms through which epigenetic marks are modified and influence gene expression levels are still being elucidated. Commonly studied epigenetic traits include methylation of CpG dinucleotides, chemical modifications of histone protein tails and non-coding RNA

33 CHAPTER 1: INTRODUCTION species42 that work together in concert to direct phenotypic changes. Epigenetic changes have been shown to be altered by both genetic and environmental contributions. Studies in monozygotic twin pairs (i.e. identical genetic background) have revealed that subtle epigenetic changes acquired through their lifespan result in downstream alterations of gene expression – with this epigenetic drift being emphasized in cases of vastly differential environmental exposure (e.g. smoking status)45. These differences in epigenomic profiles have been suggested to contribute to discordant disease status noted among certain monozygotic twin pairs45. Most studies of epigenetic marks have been conducted in bioavailable tissues making extrapolation of their full contribution to disease status difficult.

1.3.2 Reference Epigenome Efforts

The Human Genome Project (HGP) was a landmark endeavor that propelled the field of genetics into genomics. This project ultimately provided the scientific community with public access to the 3 billion base pairs (bp) of the Human genome sequence – the largest genome sequenced at the time of release46,47. HGP set a precedence for collaborative efforts to promote scientific and technological advancements, leading to the establishment of other forward-thinking large-scale projects47.

An important finding of the HGP was that a majority of the genome is in fact non- coding DNA – with coding genes representing less than 3% of base pairs. This observation paved the way for the field of functional genomics47 to help interpret the role of genomic regions. Due to the cell-type and developmental-stage specific nature

34 CHAPTER 1: INTRODUCTION of regional genomic activity, projects in this field pose added challenges to capture the full depth of genomic functions. The Encyclopedia of DNA Elements (ENCODE) project was launched in 2003 by the National Human Genome Research Institute

(NHGRI)48 to catalog all functional elements of the genome with a focus on cultured cell lines49. The Roadmap Consortium was implemented in 200850 with the aim to generate complete epigenomes in ex-vivo disease-linked tissues and stem cells that cumulated in the release of 111 reference epigenomes as of 201527. Different complementary mapping projects have been launched to fill in gaps in targeted cells/tissues including the successful BLUEPRINT epigenome project51,52, which centers their efforts on hematopoietic cells of healthy and diseased individuals. As of

2010, these consortiums and others fall under the umbrella of the International

Human Epigenome Consortium (IHEC) that aims to generate 1000 full epigenome maps of human cells and tissues53. Together, these collaborative efforts serve to provide epigenomes to the scientific community to help decipher the genetic basis of disease and regulation of gene expression. As with the HGP, these consortiums also serve to push forward technological and statistical methods to discover and interpret epigenomic marks.

1.3.3 Defining Regulatory Elements

Epigenome profiles can be used to inform genome function. Promoters are regulatory regions that lie at the 5’ end of coding sequences and that operate as the starting position of transcription. The efficiency of transcription is known to be modulated by

35 CHAPTER 1: INTRODUCTION distal regulatory elements through protein (e.g. transcription factors) interactions that physically bring these genomic regions into contact. Chromatin accessibility is a crucial factor in gene expression cellular activity. Assessing epigenome patterns genome-wide has provided us with a wealth of information concerning the relationship of epigenetic traits to regulatory regions and gene expression as well as the crosstalk between these maps25,27. In this section, we focus on the first point and we review the relationship between these traits as well as gene expression in the next section.

Evaluating chromatin accessibility has been informative in defining active regulatory regions. One of the more successful genome-scale methods employed DNase I (DNAse

I hypersensitivity site using sequencing; DNaseI-Seq) to reveal unoccupied (i.e. nucleosome-free) DNA regions - dubbed DNase I hypersensitive sites given their pre- disposition for degradation by this enzyme54. A first large-scale profiling effort was released under ENCODE for 125 cell and tissue types55. These footprint profiles revealed that promoter regions (i.e. overlapping transcription start sites; TSS) are mostly shared across sample types whereas distal elements are more cell-specific.

Further investigation of these profiles in combination with footprints from Roadmap showed distinctive enrichment of GWAS SNPs within DNAse I hypersensitivity sites in disease-relevant tissues56, indicating the importance of identifying cell-specific regulatory elements. The release of protocols for ATAC-Seq in 2013 permitted the investigation of chromatin accessibility in additional cell and tissue types that were previously unattainable due to the high cell number requirements of DNAseI-Seq.

36 CHAPTER 1: INTRODUCTION

Another type of epigenetic mark that is informative for the functional annotation of the genome are the various post-translational modifications (e.g. acetylation, methylation, etc.) of histone that bind together with DNA to make-up nucleosomes. Studies have shown coordination of histone mark positioning that points to a histone code linked to genome function57, which is not yet fully understood.

Some distinguishing features of genomic elements include the localization of

H3K4me3 marks at the TSS of active genes58 and H3K4me1 at enhancers59. In 2010,

Ernst and Kellis26 released an unsupervised statistical model that aimed to define chromatin states through combinations of histone marks using human T-cells as a model. Through these efforts, they were able to identify 51 individual chromatin states with distinguishable features and histone patterns26, contributing to our understanding of epigenetic marks in gene expression regulation.

Methylation at CpGs can also be used to infer chromatin activity25,55.

Hypomethylated footprints have been shown to correlate with transcription factor occupancy and, by extension, with active chromatin regions60. Algorithms have been developed to detect unmethylated (UMR; putative promoters) and low-methylated regions (LMR; putative enhancers) on a genome-wide scale61. Given the focus on DNA methylation in this presented work, the sections below provide additional background on this topic.

37 CHAPTER 1: INTRODUCTION

1.3.4 Coordinated Patterns of Epigenetic Regulation

Although studying epigenetic marks individually reveals key correlations to chromatin accessibility and gene expression levels, integrative analyses of these maps, mainly from large consortium efforts, provide a broader and more reliable picture of genome regulation1. Commonalities between these epigenomic traits, as well as to gene expression have been reported. For instance, DNA methylation status shows correlations to histone marks and gene expression62,63 with differences in DNA methylation profiles found to be more distinct near more strongly expressed genes27.

Similarly, particular combinations of histone marks were shown to overlap specific patterns of DNA methylation and chromatin accessibility25,27,64. For instance, active promoters (marked by H3K4me3) are known to be associated with hypomethylation of CpGs mapping within these elements. Enrichment of tissue-specific variable CpG sites within putative enhancer regions and co-localization with tissue-specific transcription factor binding sites were further observed65,66. Evidence for coordination across regulatory regions were also reported with correlations between peaks of chromatin accessibility detected, with results from chromatin conformation studies supporting these findings25. Additional studies are needed to decipher the mechanism of crosstalk between different layers of epigenetic regulation.

38 CHAPTER 1: INTRODUCTION

1.4 DNA methylation

DNA methylation is the most commonly studied epigenetic mark and the main cellular trait used to help decipher complex disease etiology in this doctoral thesis.

Biochemically, DNA methylation is characterized by the reversible addition of a methyl group (-CH3) to a cytosine residue located directly upstream of a guanine residue. Genome-wide, these CpG dinucleotides (i.e. methylome) correspond to ~30 million sites with potential for modification. Although cytosine bases in other sequence contexts have been shown to be methylated, they are not the main focus of the current study. CpGs are not distributed evenly throughout the genome. For instance, CpG islands (CGIs) correspond to regions of densely populated clusters of

CpGs that localize to the TSS of approximately half of all coding genes67. The addition of methyl groups to cytosine residues is catalyzed by DNA methyltransferases -

DNMT3A, DNMT3B and DNMT1. All three enzymes are involved in maintenance of

DNA methylation marks whereas the former two proteins show additional de novo transfer capabilities67,68. Studies have demonstrated that DNA methylation status at

CpGs is dynamic – primarily in regions overlapping regulatory elements25,65.

Reversal of the methyl group deposition in the CpG context involves several key players including ten-eleven translocation (TET) methylcytosine dioxygenases and thymine DNA glycosylase67,68.

39 CHAPTER 1: INTRODUCTION

1.4.1 Roles of DNA Methylation across the Genome

Deciphering the functional properties of DNA methylation in gene regulation is still underway. Emergent trends in the field have superseded the classical view for the role of DNA methylation in passive silencing through hypermethylation at CGIs. A broader stance is now supported where genomic location provides context for functional classification of CpG methylation67. Here, we present a brief summary of known functional roles of DNA methylation genome-wide. As noted above, hypermethylation of CGI overlapping TSS have been directly correlated to long-term gene expression silencing as exemplified through such processes as imprinting and

X- inactivation67. Interestingly, hypermethylation of CGI mapping to gene body regions do not result in terminated transcription elongation, indicating that this role is location-specific. As a whole, genic methylation at non-CGI is still under investigation with suggested functions in gene splicing69 and silencing of intragenic promoters67,68,70. CpG hypermethylation has also been reported genome- wide at repetitive elements supporting a long-term silencing mechanism of transposable elements within these sequences67. Whereas hypermethylation at most genomic locations (except at gene bodies) supports a repressive role for this mark, hypomethylated regions have been suggested to represent the active and functional portion of the methylome. Comparative studies of methylomes across tissues, cell types and individuals have revealed that a majority of CpG methylation (~80%) is static genome-wide65. Evidence points towards informative marks being enriched in highly dynamic hypomethylated (non-CGI) regulatory elements60,65, which depict a

40 CHAPTER 1: INTRODUCTION strong link to complex disease traits such as metabolic phenotypes. Although functional methylome landscapes in human tissues are not yet fully defined, studies have demonstrated that tissue-specific and disease-linked variants are enriched in active putative enhancer elements but depleted in promoter regions25,66.

1.4.2 Methods to Study DNA Methylation

This doctoral thesis presents a novel next-generation capture-based sequencing technique for methylation profiling that resolves some of the technical biases of currently available methods for genome-wide assessment of individual CpG methylation status. These techniques can be separated based on their methodology

(i.e. microarray or sequencing-based) and their genome coverage capability (i.e. biased versus unbiased). Most large-scale methylation assessment studies (e.g. metQTL and trait-EWAS) of complex traits to date have used targeted microarray technologies such as the Illumina Infinium HumanMethylation27 BeadChip (27K array) and, more prominently, the Infinium Human Methylation450 Beadchip array

(450K array)71 that interrogate ~27K and ~480K CpG sites, respectively. Briefly, these array techniques rely on the hybridization of bisulfite-treated DNA, which results in the conversion of unmethylated cytosine residues to thymine residues through an uracil intermediate, to probes followed by single-base extension and detection. Notably, these technologies are biased to CpG-dense static promoter regions65,66. Comparatively, whole-genome bisulfite sequencing (WGBS) represents an unbiased approach for the characterization of the full methylation landscape. In

41 CHAPTER 1: INTRODUCTION this method, whole-genome sequencing is performed following bisulfite treatment of the extracted DNA. However, the high costs needed to achieve sufficient sequencing coverage make its application currently impractical for most large cohort studies.

This financial factor coupled with studies showing that only ~20% of the methylome is variable across human tissues65, further illustrates the impracticality of the approach. Other sequencing approaches, such as reduced representation bisulfite sequencing (RRBS), have been developed and used in methylation studies to counter the limitations posed by WGBS. This technique employs a similar strategy to WGBS but includes a step that enriches for CpG dense fragments of the genome recognized by the MspI restriction enzyme. Due to the affinity of this enzyme, RRBS shows similar biases as array-based methods for promoter regions, thereby, preventing a full assessment of variable hypomethylated regions across the methylome.

1.4.3 Epigenome-wide Association Studies of Common Traits

DNA methylation is known to be variable across tissues and individuals – with genetic72, environmental73 and stochastic effects underlying this variation. Given that complex traits are known to be modulated by a combination of genetic and environmental factors, it has been postulated that DNA methylation could be used as a proxy to study the impact of these factors on phenotypes and diseases. Changes in methylation patterns have been linked to developmental events and various disease states74,75. To date, most large-scale EWAS of complex traits have applied array-based methods that are biased to promoter regions in bioavailable tissues, such as whole

42 CHAPTER 1: INTRODUCTION blood, as opposed to disease-linked tissues. A selection of prior impactful findings in the field of EWAS are summarized in this section.

The first large-scale epigenome-wide study of DNA methylation was released by Liu et al.76 in 2013, and focused on the complex immune system-related trait of rheumatoid arthritis (RA). In this work, an EWAS of ~350 whole blood case and control samples identified two regions within the major histocompatibility complex

(MHC) region76. At the time of publication, this cohort size was the largest release to date where adjustments for tissue heterogeneity were additionally considered.

However, replication efforts were conducted in only a small cohort of disease-linked monocyte-fractions extracted from whole blood samples of 12 cases and controls76.

Given the identification of clustered CpGs linked to the disease-state, the authors suggested that denser coverage than that offered by the 450K array would be beneficial for full assessment of CpGs linked to complex traits76.

Another notable EWAS was released in 201477 that focused on the common disease of

T2D. Although comprising a smaller cohort of cases and controls, the study presented the first reference methylome for human pancreatic islets and identified ~1,700 disease-linked CpGs enriched in intergenic regions but depleted in CGIs77. Through their focus on a biologically-relevant tissue, the authors demonstrated the effectiveness of investigating disease-linked tissues in EWAS.

In 2014, Dick et al.78 presented an important study in the field of epigenomics with their genome-wide assessment of whole blood DNA methylation and its impact on obesity status. Their work represented the first large-scale effort of its kind, which

43 CHAPTER 1: INTRODUCTION encompassed multiple sequential replication cohorts (N>300/cohort) across tissues including the metabolically-linked adipose tissue. By linking CpG methylation status, profiled on the 450K array, to body mass index, an intronic region of the metabolically-linked79-81 HIF3A loci was associated to obesity across cohorts78.

The remainder of the EWAS section will focus on studies that investigate the association of DNA methylation with serum lipid traits such as HDL-C, LDL-C, TC and TG. Classically, these phenotypes have been studied in the context of cardiovascular risk. However, these traits are additionally linked to metabolic pathways and used as diagnostic measures in obesity. As such, these complex traits are a main focus of the presented doctoral work. Multiple large-scale EWAS of increasing cohort sizes have been sequentially released where methylation and lipid phenotype associations have been interrogated; (1) Irvin et al.82 identified 2 lipid- linked CpGs using a discovery cohort (DC) of ~1000 individuals and replication cohort

(RC) of ~1,300 individuals; (2) Pfeiffer et al.83 reported 8 lipid-associated CpGs using a DC of ~1,800 samples and RC of ~500 individuals; (3) Dekkers et al.84 conducted a meta-analysis of 6 cohorts totaling ~3,300 individuals and identified 6 CpGs associated to lipid traits; (4) Sayols-Baixeras et al.85 reported 14 lipid-linked CpGs by interrogating a DC of ~700 individuals and RC of ~2,500 individuals; (5) Hedman et al.86 replicated lipid-interactions at 33 CpGs across two meta-analyses of multiple cohorts totaling to ~2,300 individuals in DC and ~2,200 individuals in RC.

In all these studies, CpG methylation was profiled using the 450K array in whole blood samples with some replication efforts in subcutaneous adipose tissue83,86.

44 CHAPTER 1: INTRODUCTION

Another commonality between these studies includes the reporting of lipid-associated

CpGs that link to genes involved in lipid metabolism including CPT1A82,84,85,

ABCG183-86 and SREBF183-85. Of note, an investigation of lipid-linked CpGs in metabolically-linked tissues such as visceral adipose tissue or liver remains to be undertaken to assess their full impact in metabolic diseases. These studies further bring into question the issue of causality of DNA methylation in disease mechanism.

Although the cross-sectional nature of most released studies do not allow for this type of assessment, Dekkers et al.84 presented evidence that Mendelian randomization can be used to infer causality and that blood lipid levels can have an effect on DNA methylation. However, this study is limited in biological interpretation by the use of whole blood for their investigation and no studies have assessed this causal relationship in biologically-relevant tissues of metabolic disease so far. Much debate concerning the topic of causality still exists in the field.

1.5 Rationale, Hypothesis and Objectives

Full epigenome maps of metabolic disease-linked adipose tissue, especially in terms of CpG methylation and chromatin accessibility, are still lacking27. Studies have demonstrated that disease-associated variants and CpGs are enriched in active regulatory elements. None of the available DNA methylation techniques are optimal for comprehensive studies of methylation variation and their impact on complex traits. To overcome the limitations posed by these profiling techniques, we proposed that selective interrogation of functional methylomes in disease-linked tissues by

45 CHAPTER 1: INTRODUCTION next-generation sequencing (NGS) would allow identification and in-depth profiling of biologically-relevant variants associated with complex traits in a high-powered and cost-effective manner. We further postulated that integrating these single-base resolution methylomes with other layers of ‘omics’ data would be informative to further decipher complex trait etiology.

As such, the objective of the presented PhD work was to contribute to the understanding of the genetic and epigenetic basis of complex traits focusing on metabolic disease-related phenotypes. In the first chapter, we aimed to implement a protocol for a novel NGS-capture method (MethylC-Capture Sequencing; MCC-Seq) that allows for simultaneous targeted interrogation of genotypes and tissue-specific functional methylomes in disease-relevant tissues. Using adipose tissue as a model, we designed custom capture panels reflecting the functional methylome of this metabolic disease-linked tissue. Through this study, we aimed to (1) provide a technical validation of the approach through comparisons with other available methylation and genotyping profiling methods and (2) illustrate the potential of the method as a feasible alternative for EWAS by presenting a pilot study of visceral adipose tissue from 72 obese individuals undergoing bariatric surgery87 (IUCPQ) where we linked triglycerides to methylation status.

In the second study, we applied the MCC-Seq method to an extended cohort of ~200 matched visceral adipose and whole blood tissues from the same clinically relevant obese population (IUCPQ). We aimed to provide an extended catalog of lipid- associated epigenetic variants located at adipose tissue regulatory elements. We

46 CHAPTER 1: INTRODUCTION further sought to characterize the identified lipid-associated epi-variants through investigation of (1) localization patterns in regulatory elements, (2) tissue-specific nature, (3) genetic and non-genetic regulation contributions, and (4) integrational approaches with additional layers of ‘omics’ data from complementary studies.

47 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Chapter 2: Characterization of Methylomes by Next-Generation Capture Sequencing

2.1 Bridging Statement between Chapter 1 and 2

Methylation profiling for large-scale EWAS of complex traits have commonly employed array-based technologies (e.g. 450K array) in bioavailable tissues such as whole blood. Studies have reported that variants linked to common diseases are enrichment in open chromatin of biologically-relevant tissues. However, as the 450K array is biased to promoter regions, it does not allow for full investigation of functional methylomes. To overcome this limitation, we partnered with Roche

NimbleGen to implement an approach that permits single-base resolution profiling of CpGs over targeted regions. We dubbed this method methylC-capture sequencing

(MCC-Seq). We formed a close collaboration with the IUCPQ in Québec city (Canada), which have collected tissue samples from deeply-phenotyped obese individuals undergoing weight management surgical procedures. This relationship provided access to visceral adipose tissue samples – enabling us to generate the first epigenome-wide maps of methylation, gene expression and chromatin accessibility

(i.e. ATAC-Seq) for this tissue type. Linking triglyceride (TG) levels to methylation status across a pilot cohort of 72 VAT samples, we generated the first EWAS investigating all regulatory regions of a tissue of interest at high coverage (X30).

Finally, we further collaborated with the MuTHER consortium in the United

Kingdom for replication of our findings and, thus, validation of our presented method.

48 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

In summary, here we present our proof of concept study of MCC-Seq for application in epigenome-wide studies.

49

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

2.2 Title, Authors and Affiliations

Characterization of Functional Methylomes by Next-Generation Capture

Sequencing Identifies Novel Disease Associated Variants

Fiona Allum1,2,*, Xiaojian Shao1,2,*, Frédéric Guénard3, Marie-Michelle Simon1,2,

Stephan Busche1,2, Maxime Caron1,2, John Lambourne1,2, Julie Lessard4, Karolina

Tandre5, Åsa K Hedman6,7, Tony Kwan1,2, Bing Ge1,2, The Multiple Tissue Human

Expression Resource Consortium**, Lars Rönnblom5, Mark I McCarthy8,9,10, Panos

Deloukas11,12, Todd Richmond13, Daniel Burgess13, Timothy D Spector14, André

Tchernof4, Simon Marceau4, Mark Lathrop1,2, Marie-Claude Vohl3, Tomi Pastinen1,2,

Elin Grundberg1,2

1Department of Human Genetics, McGill University, 740 Docteur-Penfield Avenue,

Montreal, QC H3A 0G1, Canada

2McGill University and Genome Quebec Innovation Centre, 740 Docteur-Penfield

Avenue, Montreal, QC H3A 0G1, Canada

3Institute of Nutrition and Functional Foods (INAF), Université Laval, 2440

Hochelaga Boulevard, Québec, QC G1V 0A6, Canada

4Québec Heart and Lung Institute, Université Laval, 2725 Sainte-Foy Road, Québec,

QC G1V 4G5, Canada

50

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

5Department of Medical Sciences, Uppsala University, Akademiska sjukhuset Ing.

40, Uppsala 75185, Sweden

6Department of Medical Sciences, Molecular Epidemiology, Dag Hammarskjölds väg

14B, Uppsala University, Uppsala 75185, Sweden

7Science for Life Laboratory, Dag Hammarskjölds väg 14B, Uppsala University,

Uppsala 75185, Sweden

8Wellcome Trust Centre for Human Genetics, University of Oxford, Roosevelt Dr,

Oxford OX3 7BN, UK

9Oxford Centre for Diabetes, Endocrinology & Metabolism, University of Oxford,

Churchill Hospital, Headington, Oxford OX3 7JU, UK

10Oxford National Institute for Health Research Biomedical Research Centre,

Churchill Hospital, Headington, Oxford OX3 7JU, UK

11Wellcome Trust Sanger Institute, Wellcome Trust Genome Campus, Hinxton,

Cambridge CB10 1SA, UK

12William Harvey Research Institute, Barts and The London School of Medicine and

Dentistry, Queen Mary University of London, Charterhouse Square, London EC1M

6BQ, UK

13Roche NimbleGen, 500 South Rosa Road, Madison, WI 53719, US

14Department of Twin Research and Genetic Epidemiology, King's College London, St

Thomas’ Campus, Lambeth Palace Road, London SE17EH, UK

*These authors contributed equally to this work

51

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

**Full list of members and affiliations appears at the end of this paper

Correspondence should be addressed to E.G. ([email protected])

Published in: Nature Communications 2015 May; 6: 7211. doi: 10.1038/ncomms8211.

52

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

2.3 Abstract

Most genome-wide methylation studies (EWAS) of multifactorial disease traits use targeted arrays or enrichment methodologies preferentially covering CpG-dense regions, to characterize sufficiently large samples. To overcome this limitation, we present here a new customizable, cost-effective approach, methylC-capture sequencing (MCC-Seq), for sequencing functional methylomes, while simultaneously providing genetic variation information. To illustrate MCC-Seq, we use whole- genome bisulfite sequencing on adipose tissue (AT) samples and public databases to design AT-specific panels. We establish its efficiency for high-density interrogation of methylome variability by systematic comparisons with other approaches and demonstrate its applicability by identifying novel methylation variation within enhancers strongly correlated to plasma triglyceride and HDL-cholesterol, including at CD36. Our more comprehensive AT panel assesses tissue methylation and genotypes in parallel at 4 and 3 M sites, respectively. Our study demonstrates that

MCC-Seq provides comparable accuracy to alternative approaches but enables more efficient cataloguing of functional and disease-relevant epigenetic and genetic variants for large-scale EWAS.

53

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

2.4 Introduction

DNA methylation is an epigenetic modification that was previously thought to be important only for gene silencing through hypermethylation of CpG islands in promoter regions; however, recent studies have revealed more diverse functions dependent on genomic location67. For instance, hypermethylation within the gene bodies is likely to be indicative of primed expression and is associated with increased gene expression25,66. Profiling of histone modifications by chromatin immunoprecipitation and high-throughput sequencing (ChIP-Seq) has uncovered strong correlations between chromatin structure and DNA methylation, with hypomethylated regions associated to active marks or open chromatin and hypermethylated regions suggestive of repressed regulatory regions and heterochromatin25. H3K4me3, known to mark nucleosomes flanking transcription start sites and CpG-rich promoters, is negatively associated with DNA methylation, whereas distal regulatory elements (that is, enhancers) marked by H3K4me1 are relatively CpG poor with a more variable hypo- to hemimethylated profile66. As the majority of approaches assessing the human methylation landscape have been biased to CpG-rich regions66, the methylation pattern of enhancers remains to be described in more detail.

Other investigated features of DNA methylation variation in human populations include the effects of environmental88 and genetic factors66,89, and the role of methylation in complex disease susceptibility76-78. We recently estimated methylation levels of 450,000 CpGs in subcutaneous adipose tissue (AT) across 648 female twins

54

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING and identified common sequence variants that contribute significantly to methylation variability, with indications that some of these variants may also mediate genetic risk for metabolic diseases66 possibly through changes in methylation levels. We further noted that in these variables, disease-linked methylation sites are enriched in distal regulatory elements, paralleling earlier findings of common sequence variants identified in genome-wide association studies (GWAS) being enriched in active chromatin measured by DNaseI hypersensitivity28 or within tissue-specific enhancer marks90. In addition, the most comprehensive study to date of methylation profiles across multiple tissues also highlighted that enhancers contain tissue-specific variable CpGs that co-localize with tissue-specific transcription factors65.

However, the majority of methylation quantitative trait loci (QTL) and epigenome- wide association studies (EWAS) presented to date have used the Illumina Human

Methylation 450 BeadChip array (Illumina 450K array). Although covering 480,000

CpGs in the human genome, the Illumina 450K array is biased towards regions with high CpG content such as promoters, which we and others have demonstrated to have limited inter-individual and inter-tissue variation66,65. Largely due to the invariable state of promoter-located CpGs, these regions are also known to be depleted among significant disease-associated sites66. Importantly, tissue-specific and disease- relevant regions such as enhancers are greatly underrepresented on the Illumina

450K array.

In contrast to available targeted methodologies91 or alternative sequencing methods biased towards CpG-dense regions such as reduced representation bisulfite

55

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING sequencing (RRBS) or methylated DNA immunoprecipitation, whole-genome bisulfite sequencing (WGBS) allows complete characterization of the methylation landscape.

However, with only 20% or less of CpGs being variable across individuals or tissues65, WGBS is inefficient for large-scale population studies, as it has a high cost and requires in-depth sequencing capacity to achieve sufficient coverage. Thus, none of the above methods are optimal for comprehensive studies of methylation variation and their impact on complex diseases. Alternative approaches to interrogate functional (that is, regulatory active) methylomes are needed for more comprehensive yet cost-effective identification of biologically relevant CpGs associated to complex diseases.

Here, we present methylC-capture sequencing (MCC-Seq), a next-generation sequencing capture approach interrogating functional methylomes in disease- targeted tissues or cells. We design AT-specific panels to probe up to 4.5 ×

106 putative functional DNA methylation sites as defined by their localization to hypomethylated footprints and regulatory elements, as well as 2.8 × 106 single- nucleotide polymorphisms (SNPs) for simultaneous genotyping profiling. We validate the method through comparisons with WGBS, Illumina 450K array and Agilent

SureSelect Human Methyl-Seq (Agilent SureSelect) data, and show that MCC-Seq yields comparable resolution over targeted intervals. We demonstrate the ability of

MCC-Seq to identify novel biologically relevant epigenetic variants associated to disease by applying the panel in a disease–trait association study of 72 individuals.

Our initial results illustrate the advantages of this new approach over currently used

56

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING methods for methylome analysis, providing a viable alternative for powerful and cost- effective large-scale interrogation of functional methylomes.

57

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

2.5 Results

2.5.1 First-generation Capture Panel Design for MCC-Seq

Using human AT as a model, we designed a first-generation sequence panel to capture the putative functional and disease-linked methylome in AT (Met V1) (Table 1 and

Methods). We targeted 87 Mb of sequence comprising (1) hypomethylated footprints, generated from WGBS data of 30 AT samples, (2) regulatory elements (identified by

H3K4me1 and H3K4me3 ChIP-Seq) in human adipocytes characterized by the NIH

Roadmap Epigenomics Mapping Consortium and (3) 50 K CpGs with known association to metabolic phenotypes66 (Supplementary Data 1). Altogether, the panel targets 2,496,975 unique CpGs with 210,883 directly overlapping Illumina 450K array-targeted CpGs. By including both putative enhancer and promoter regions, we aimed to obtain a more complete profile of AT-specific regulatory regions and to investigate the variability status of these CpGs at increased depth over previous studies66.

In MCC-Seq, a whole-genome sequencing library is prepared, bisulfite converted and amplified, followed by a capture enriched for targeted bisulfite-converted DNA fragments (Methods). This is achieved through the novel SeqCap Epi probe design platform by Roche NimbleGen, which enables capture of double-stranded targets regardless of their methylated state via high tiling density of probes. To test the efficiency and performance of Met V1, we performed targeted enrichment of both uniplex (1-plex) and multiplexed library samples (2-plex, 4-plex, 6-plex and 10-plex).

Each capture was sequenced on a single lane of the 100 bp paired-end Illumina

58

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

HiSeq2000/2500 System. Generated reads were aligned to the converted reference genome using BWA v.0.6.192 and filtered according to our benchmark bioinformatics workflow (Methods) using a read depth cutoff per CpG of ≥5X.

The sequence statistics obtained for the different captured pooled samples are summarized in Supplementary Table 1. The average on-target CpG read depth ranged from 13X (10-plex) to 82X (1-plex) and the percentage of total reads that mapped within the target CpGs averaged 62% (ranging from 51% to 80%), and was independent of the degree of multiplexing. The average number of targeted CpGs with ≥5X depth of sequence coverage decreased with increasing multiplexing from

94% (1-plex) to 63% (10-plex) of targeted CpGs (Supplementary Table 1).

2.5.2 Second-generation Panel Design for Comprehensive Profiling

Based on the performance of the first AT-specific panel, we developed and assessed a comprehensive second-generation AT MCC-Seq panel (Met V2) that encompasses additional AT-specific regulatory regions and variants, and additional SNPs throughout the genome for simultaneous methylation and genetic association studies

(Table 1 and Methods). The Met V2 panel targets 156 Mb of sequence spanning

4,442,383 unique CpGs and 2,840,815 autosomal biallelic SNPs from dbSNP 137. The regions covered by the Met V2 panel include the following: (1) CpGs contained within low (LMRs) and unmethylated regions (UMRs) identified from merged data sets of 30

WGBS AT samples (Supplementary Data 2); (2) CpGs located within human adipocyte regulatory elements (H3K4me1 and H3K4me3) from the NIH Roadmap

Epigenomics Mapping Consortium; (3) all 482,421 CpGs from the Illumina 450K

59

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING array; (4) 28,947 regions covering metabolic disease-associated GWAS loci from the

National Human Genome Research Institute GWAS catalogue (9 January 2014); and

(5) the 256,327 SNPs from the Illumina HumanCore BeadChip.

To assess the performance of the larger Met V2 panel, we applied the MCC-Seq protocol and performed targeted enrichment on a 6-plex capture. Sequencing was conducted on one lane of the 100-bp paired-end Illumina HiSeq2000/2500 System. On average, 62% of the reads mapped to target regions with 15X mean coverage and 65% of the target regions were covered at a sequence depth of ≥5X (Supplementary Table

2).

2.5.3 Sample-based Validation of MCC-Seq

As a first validation step, we assessed the effects of technical variability on methylation profiles by comparing the results obtained from a single DNA sample derived from visceral AT (VAT) prepared in replicate experiments with independent captures on the same panel (Met V1) and different degrees of multiplexing (4-plex versus 10-plex). We found a high concordance of the methylation calls for overlapping

CpGs (N=1,587,026; average coverage4-plex=36X; average coverage10-plex=30X; R=0.98;

Pearson's correlation is used throughout; Fig. 1A). We also assessed technical variability of the methylation calls by comparing the results from a different DNA sample prepared in replicate experiments with independent captures using the two different panels and different degrees of multiplexing (Met V14-plex versus Met V26- plex), and confirmed the high concordance in methylation calls for CpGs (N=1,569,170; average coverageMetV1(4-plex)=16X; average coverageMetV2(6-plex)=30X; R=0.97; Fig. 1B).

60

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Next, we compared MCC-Seq methylation with WGBS data from the same sample.

Here we also found a high correlation (MCC-Seq versus WGBS) for overlapping CpGs

(N=1,620,874; average coverageMCC-Seq=31X; average coverageWGBS=23X; R=0.97; Fig.

1C). In addition, we evaluated the sequence results against the Illumina 450K array data on the subset of these CpGs that are included in the array by comparing this method against both MCC-Seq and WGBS. For both comparisons, we obtained correlations of R=0.96 (N=150,898; average coverageMCC-Seq=32X; average coverageWGBS=23X; Fig. 2 and Supplementary Table 3). To rule out any biases in the comparisons, we also restricted the correlations to CpGs with intermediate methylation levels by excluding completely hypo- (0%) and hypermethylated (100%)

CpGs based on the WGBS and MCC-Seq data. Encouragingly, we found the high correlation being maintained with R=0.95 (N=45,097; average coverageMCC-Seq=33X) and R=0.94 (N=45,097; average coverageWGBS=25X) for MCC-Seq versus Illumina

450K and WGBS versus Illumina 450K, respectively (Supplementary Fig. 1). Using this limited set of CpGs profiles across multiple approaches we were also able to confirm the importance of generating sufficient sequence depth for accurate methylation calls, as correlation was shown to improve with increased read-depth cutoffs (Supplementary Table 4). Similar improvement in correlations of methylation calls by MCC-Seq and Illumina 450K was seen with increasing read depth

(Supplementary Fig. 2).

Finally, we contrasted methylation calls from MCC-Seq against Agilent SureSelect— another targeted-sequencing approach based on a different methylation capture

61

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING strategy than described here, allowing only single-strand capture of smaller target regions, and thus not suitable for comprehensive genotype profiling. More specifically,

MCC-Seq relies on the efficient capture of targeted methylated and unmethylated

CpGs (up to 160 Mb or 4.4 M CpGs) in bisulfite-converted libraries, whereas Agilent

SureSelect captures target regions before bisulfite conversion and requires larger amounts of input DNA. By juxtaposing both capture approaches using the same sample sequenced at extreme depth, we obtained correlations that mimic those of our technical replicates shown above (N=2,551,186; average coverageSureSelect=137X; average coverageMCC-Seq=216X; R=0.99; Supplementary Fig. 3A). This high correlation (N=1,734,371; average coverageSureSelect=156X; average coverageSureSelect=230X; R=0.99; Supplementary Fig. 3B) was also seen when excluding completely hypo- (0%) and hypermethylated (100%) CpGs in both approaches, indicating accuracy in measurement for CpGs with intermediate methylation levels as shown above.

2.5.4 Population-based Validation of MCC-Seq

We then applied MCC-Seq Met V1 to a set of 72 VAT samples derived from obese individuals (body mass index (BMI) >40 kg m−2) aged 19–67 years, undergoing bariatric surgery and diagnosed with or without metabolic syndrome87 (Methods).

Metabolic syndrome was diagnosed when individuals had abdominal obesity and at least two of the following four criteria set by the National Cholesterol Education

Program Adult Treatment panel III11: elevated plasma fasting glucose, high triglyceride (TG) levels, high blood pressure or lower high-density lipoprotein

62

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING cholesterol (HDL-C) levels. Using the 4-plex pooling approach, we sequenced the samples to an average read depth of 25X for on-target CpGs. At a sequence depth of

≥5X, a total of 2,147,576 CpGs were detected in at least one individual, with 1,882,222

(88%) CpGs detected in at least 50% of the samples. In all subsequent population- based analyses, we required ≥5X coverage based on our comparisons described above

(Supplementary Fig. 2). In addition to requiring ≥5X coverage, we eliminated CpGs that had low coverage, by removing those that were below the 20th percentile for averaged coverage over the 72 samples for the distribution across all CpGs. This yielded 1,710,209 CpGs for further consideration (Supplementary Fig. 4 and

Methods) with an average sequence depth of 30X and a minimum of 18X. An outline of all population-based analyses is shown in Supplementary Fig. 5.

First, we characterized the methylation pattern of these 1.7 M CpGs assessed within

72 AT samples and noted that, as expected, the majority (69%) of the captured CpGs exhibited a hypomethylated pattern (defined as <20% methylation) with only 17% being hemi- to hypermethylated (defined as >50% methylation; Supplementary Fig.

6). We also characterized these CpGs by assessing their genomic localization within putative regulatory regions through their overlap with histone marks (H3K4me1 and

H3K4me3) in human adipocytes and hypomethylated footprints from our WGBS on

30 AT samples (Methods). To do this, we first characterized hypomethylated footprints by distinguishing between LMRs and UMRs in the WGBS data as previously described61 (Methods and Supplementary Data 2). We noted that LMRs were associated with CpG-poor distal regulatory regions (average methylation level

63

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING of 24%), whereas UMRs are CpG-dense and mapped principally to promoter regions

(average methylation level of 9%; Supplementary Fig. 7). For the regulatory elements overlapping H3K4me3 marks (active promoters), we restricted our analysis to locations within 1 kb of transcription start sites of known RefSeq transcripts and not overlapping H3K4me1 marks as previously described66. We then assessed the population variability of methylation levels for CpGs mapping to H3K4me1 marks or

LMRs (putative enhancers) and compared this with similar estimates of methylation variation for CpGs mapping to H3K4me3 marks or UMRs (putative promoter regions). As previously reported66, methylation of CpGs that map to enhancer elements are more variable across individuals (median s.d.=9.4), whereas promoter regions display a more invariable pattern (median s.d.=1.5; Supplementary Fig. 8).

We then profiled a subset (N=24) of the 72 VAT samples (Supplementary Fig. 5) with the Illumina 450K array, for direct comparisons of methylation scores estimated by the two methods when considering multiple samples. We applied a normalization method on the Illumina 450K array data to reduce technical biases that have been shown to have an impact on the β-values93 (Methods). The average correlation of methylation levels estimated by the two methods was R=0.50 and R=0.58, respectively, for the top 25% (N=34,517, median s.d.=11.0) and top 10% (N=13,807; median s.d.=13.6) most variable CpGs in the MCC-Seq data based on s.d. estimates of each CpGs (Supplementary Fig. 9). These population-based correlations of MCC-

Seq versus the Illumina 450K array are noticeably lower than the sample-based correlations described above; however, given the different nature of the comparisons,

64

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING that is, correlation of the methylation measurements at each CpG in multiple individuals here versus the overall correlation across all CpGs within a single sample, they cannot be directly compared. As such, we find that the sample-based correlations across the 24 samples are similar to that described above for a single sample, ranging from R=0.93 to 0.96.

2.5.5 Population-based Genotype Profiling by MCC-Seq

The same 24 AT samples described above were also genotyped with the Illumina

HumanOmni2.5S-8 BeadChip array for validation of MCC-Seq's ability to simultaneously call genotypes. After stringent quality control, we obtained SNP genotypes at 94,600 overlapping loci using MCC-Seq (Met V1) (Methods). We observed 99% genotype concordance between the two methods at sites on the SNP array, indicating that MCC-Seq has the potential to allow for simultaneous and accurate genotyping calling over regions of interest. Similarly, comparing the observed heterozygosity from the two measurements yielded high correlation

(Supplementary Fig. 10).

In total and based on dbSNP 137, we determined that the Met V1 panel has the potential to detect 1,343,928 autosomal biallelic SNPs within its target regions, of which an average of 1,300,369 (97%) per sample were covered at a read depth of ≥5X.

In the broader Met V2 panel, there is a heightened potential for autosomal biallelic

SNP detection (2,840,815) with an average of 2,666,458 (94%) SNPs detected per sample at 5X read coverage. Thus, the performance of the Met V2 panel is similar to

65

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING that of the V1 panel, despite its more extensive coverage (for example, 156 versus

87 Mb).

2.5.6 EWAS of TG Levels using MCC-Seq

To illustrate the application of MCC-Seq for epigenome mapping of a quantitative trait, we examined plasma TG levels measured on the 72 individuals for which the

MCC-Seq Met V1 data were available. We note that TG exhibits substantial individual variability in the study cohort (Supplementary Fig. 11). To assess associations, we applied a generalized linear model (GLM) assuming a binomial distribution of methylation levels and adjusting for BMI, age and biological sex along with the sequence depth at each CpG. We assigned a nominal significance for the trait association using a permutation test (Methods). We identified 2,580 CpGs with P-value ≤0.001 (Supplementary Data 3) and 518 CpGs with P-value ≤0.0001.

The locations of these potential TG-associated CpGs were evaluated with respect to putative regulatory regions through their overlap with histone marks (H3K4me1 and

H3K4me3) in human adipocytes, and LMRs and UMRs identified as described above

(Methods). As shown in Fig. 3A, TG-associated CpGs (P≤0.001) were found to map preferentially to H3K4me1 (enhancer) histone marks and/or LMRs (Fisher's exact test P=5.3 × 10−7). This pattern was even more pronounced when information on

LMRs unique to AT and H3K4me1 peaks was combined (Methods) to demarcate putative enhancers (Fisher's exact test P=6.0 × 10−10). This supports the mounting evidence that disease–trait-associated epigenetic variants localize, to a large extent, to distal regulatory regions. Similar results were also observed when restricting the

66

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING analysis to CpGs that met the more stringent criterion of P≤0.0001 in the permutation test (Fig. 3A). Furthermore, at both P-value cutoffs, we observed depletion of TG-associated CpGs within putative promoter regions that are shared across tissues as detected by either H3K4me3 histone marks or UMRs (Fisher's exact test P=7.1 × 10−10) versus enrichment when restricting to promoter marks unique to

AT (Fisher's exact test P=2.4 × 10−3; Fig. 3B).

We further examined the subset of MCC-Seq TG-associated CpGs that overlapped nearby (250 bp flanking the CpG) CpGs from the Illumina 450K array used in an independent cohort of 650 female individuals from the MuTHER study66 with TG measurements and AT samples available. MuTHER is a population-based cohort study that includes female twins (1/3 dizygotic and 2/3 monozygotic) aged 38.7–84.6 years recruited from the TwinsUK resource94, which has previously been shown to be comparable to population singletons in terms of disease-related and lifestyle characteristics95. Methylation data on AT samples from the study members that were previously profiled on the lllumina 450K array were tested for association with TG levels using a linear mixed model similar to the above but also incorporating familial relationship, twin zygosity and other cofactors (Methods). We selected the top TG- associated MuTHER CpGs within the 250 bp of the 2,580 tested CpGs identified at P≤0.001 (Supplementary Data 3), revealing 1,582 Illumina 450K array CpGs. Of these, 124 (8%) were found to be significantly associated with TG (Bonferroni P-value threshold of P=0.05; nominal P-value: 0.05/1,582=3.2 × 10−5) in the MuTHER data with the same direction of effect as observed in our study (8.9-fold enrichment;

67

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING binomial test, P<2.2 × 10−16). Encouragingly, the replication was strengthened when limiting to Illumina 450K array CpGs directly overlapping the 2,580 TG-associated

MCC-Seq CpGs with 18 out of 171 sites (11%) significantly associated in MuTHER with the same direction of effect as in MCC-Seq results (18.0-fold enrichment; binomial test, P<2.2 × 10−16).

2.5.7 Assessment of Loci Harbouring TG-associated CpGs

We further annotated the most significant TG-associated CpGs (P≤0.0001) mapping to an AT-specific regulatory element (N=89; Fig. 3) in further detail (Supplementary

Data 4). First, we assessed their association to TG in the larger MuTHER cohort as described above. In total, 33/89 TG-associated CpGs overlapped a nearby Illumina

450K-measured CpG (250 bp flanking the CpG) and were included in the analysis.

Here, 21% were found to also be significantly associated with TG in the MuTHER cohort, with the same direction of effect using the stringent Bonferroni P-value threshold of P=0.05 as estimated above; nominal P-value=3.2 × 10−5. Furthermore, using the nominal P-value of 0.05 and the same direction of effect, as many as 16

CpGs (48%) showed evidence of association to TG in the independent cohort

(Supplementary Table 5).

We recently showed high degree of sequence dependency of AT DNA methylation and thus also examined the potential existence of genetic regulation among the TG- associated CpGs using our publicly available cis-mQTL data from AT profiled on the

Illumina 450K array66. Again, we used the 33 CpGs that overlapped a nearby

Illumina 450K as described above. We found that methylation levels of 55% of these

68

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

TG-associated CpGs are regulated by a nearby SNP at 1% false-discovery rate66 representing twofold enrichment (Fisher's exact test P=0.0017), indicating that a large proportion of trait-associated epigenetic variants identified here are under genetic control (Supplementary Table 5).

Next, we used transposase-accessible chromatin sequencing (ATAC-Seq), as detailed in the Methods, on adipocyte nuclei isolated from AT of an obese individual undergoing bariatric surgery, to further pinpoint and fine-map the effects of the TG- associated CpGs linked to hypomethylated footprints. ATAC-Seq96 is an antibody- independent method for profiling active regulatory regions by mapping open chromatin with sensitivity that is comparable to DNaseI-Seq but with the advantage of requiring only 100,000 input cells. Here we found that 65/89 CpGs (73%) were nearby (within 250 bp) or directly overlapping an ATAC-Seq peak, indicating that these CpGs could be mapped with high confidence to active regulatory regions in pure adipocytes.

We further examined the expression pattern of genes in the vicinity of our top CpGs in human adipocytes compared with various blood cell types (Methods). Candidate genes were identified as overlapping or within 100 kb of the TG-associated CpGs. We performed RNA sequencing (RNA-Seq) of adipocyte nuclei isolated from visceral and subcutaneous AT of four obese individuals undergoing bariatric surgery matching our discovery cohort, as well as from B cells, T cells and monocytes of four healthy donors

(Methods). Differential expression analysis (Methods and Supplementary Data 5) revealed 38/76 CpGs (50%) being associated with genes significantly more expressed

69

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING in adipocytes (when comparing both visceral- and subcutaneous-derived expression, log2 fold change>2, P<0.05) which is a significant enrichment of adipocyte-specific expression (1.6-fold, Fisher's exact test, P=6.6 × 10−4).

We also examined the overlap of our potential TG-associated loci with the National

Human Genome Research Institute catalogue of results from GWAS (accessed

January 2014) and found that genes linked to 19 (23%) of our CpGs were previously cited for a metabolic disease trait based on GWAS (1.5-fold enrichment; Fisher's exact test, P=0.06; Supplementary Data 5). These genes include CD36 (HDL-

C), RPTOR (obesity) and ABCG5/ABCG8(low-density lipoprotein cholesterol (LDL-C) and total cholesterol). Additional follow-up data on CD36 is provided below.

2.5.8 Follow-up of the TG-associated Loci Mapping to CD36

To illustrate these results, we selected the most significant CpG of the TG-associated loci for additional follow-up studies (chr7:80,276,086-80,276,087; GLM P=1.1 ×

10−9; Fig. 4 and Supplementary Data 4). This CpG is located within an intragenic region of CD36, a gene encoding a glycoprotein with an important role in lipid metabolism97,98 that has been linked to metabolic disease susceptibility99. Levels of circulating CD36 protein were recently reported to be positively correlated to plasma

TG levels in obese individuals100 and SNPs near the gene were associated to HDL-C levels in a large GWAS101. The TG-associated CpG maps to an LMR unique to AT.

Using RNA-Seq data generated from both human adipocytes derived from obese individuals and blood cells from healthy controls as described above (Supplementary

Data 5), we found significantly higher CD36 expression in adipocytes than in blood

70

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING cells (GLM, log2 fold change=2.4–11.0, P=6.3 × 10−22–3.3 × 10−161). In an attempt to study whether the potential enhancer region where the TG-associated CpG maps controls expression of CD36, we used our publicly available array-based expression

(IlluminaHT12) and methylation (Illumina 450K array) data from the MuTHER cohort (N650)66. We found that methylation of the closest Illumina 450K array CpG

(cg05917188; Fig. 4) was negatively associated with expression of the main CD36 transcript in AT (linear mixed model, P=2.4 × 10−5), highlighting a gene regulatory effect of our TG-associated hypomethylated region. Finally, we also used the MuTHER cohort (N650) and cg05917188 as described above, for validation of the TG association where we were able to verify the pronounced effect of methylation at the regulatory region on TG levels (linear mixed model, P=3.2 × 10−7; Fig. 4 and

Supplementary Table 5). As recent GWAS efforts show links to HDL-C, we also tested for this association to CpG methylation within our discovery cohort and found a similar pattern (GLM, P=2.93 × 10−5) with concordant results from the MuTHER cohort (linear mixed model, P=1.8 × 10−3; Fig. 4).

Taken together with the other results described above, our data provide strong evidence in favour of an epigenetic effect of the AT-specific regulatory region of CD36 on multiple metabolic disease-related traits.

71

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

2.6 Discussion

The assessment of DNA methylation has emerged as an essential tool for understanding the aetiology of human disease102. Recent reports show that variable and functional epigenetic variants are enriched in enhancers, rather than in promoter and CpG island regions66, which are the principal regions assayed by commonly used targeted approaches (for example, Illumina 450K array and RRBS). Although WGBS is comprehensive, it is inefficient for the large-scale investigations that are required for methylation QTL studies and EWAS of common multifactorial diseases. This motivated us to look for an improved method for high-resolution interrogation of the variable functional component of the methylome.

We established MCC-Seq to assess target regions of the genome in a cost-effective and accurate manner. With MCC-Seq, we can examine active regulatory regions in disease-appropriate tissues, specifically permitting us to identify disease-linked DNA methylation variants that are not identifiable with previous targeting approaches.

MCC-Seq can include up to 200 Mb in custom, user-defined interrogation panels, which is an advantage over other available capture approaches. Samples can be multiplexed to obtain lower sequencing costs for large-scale EWAS. Although upfront analysis time is needed for proper selection of CpGs, the customizable and flexible design allows easy elimination of CpGs that are invariable across individuals65, providing further savings at the sequencing and computational levels. As an example, our Met V2 adipose-specific panel covers 4.5 × 106 CpGs in regulatory regions and also includes the complete Illumina 450K panel of 480,000 CpGs, allowing

72

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING comparisons or replication with studies that use the latter. We also demonstrate the capacity for multifunctional assays providing both comprehensive methylome and

SNP genotype data, thereby permitting additional data integration than in other techniques such as Agilent SureSelect where single-strand capture bias inhibits complete genotype profiling. As such, the Met V2 panel includes the complete set of

SNPs from the Illumina HumanCore BeadChip, which covers highly informative genome-wide tag SNPs found across globally diverse populations, allowing for further high-density genotype imputation.

Comparisons of MCC-Seq to three alternative approaches—WGBS, Illumina 450K array and Agilent SureSelect—indicated that methylation calls derived from MCC-

Seq correlated highly with all three methods (for example, R>0.96) with Illumina

450K array showing slightly lower correlation. The lower correlation of MCC-Seq and

WGBS with the Illumina 450K array data may be attributable to technical differences in DNA methylation assessment (that is, microarray versus next-generation sequencing) and proabably higher specificity of methylation profiles called from sequencing methods at sufficient depth. Overall, we believe that MCC-Seq with its larger flexible platform, genotyping ability and low DNA input requirements, is more adapted to large-scale EWAS studies than other studied approaches.

Based on our results, we predict that MCC-Seq will be particularly valuable for the identification of functional, disease-linked DNA methylation variants. In fact, we demonstrated the potential of such an approach by applying an AT-specific panel to a cohort of 72 individuals in which we had measured metabolic-related traits, including

73

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

TG. In agreement with the current literature, the analysis of the TG-associated CpGs revealed a clear significant enrichment within putative enhancer regions as defined by hypomethylated regions and adipocyte-specific ChIP-Seq data (NIH Roadmap

Epigenomics Mapping Consortium) and a clear underrepresentation within putative tissue-independent promoter regions. This demonstrates the importance of investigating putative enhancer regions for functional epigenetic variant identification—regions currently underrepresented in the commonly used Illumina

450K array and RRBS approaches. When comparing the subset of results from MCC-

Seq that were available from the Illumina 450K array data on the large

MuTHER/TwinsUK cohort of 650 female twins66, we found a significant overlap of

CpGs exhibiting significant TG association in the two data sets, further validating

MCC-Seq as a tool for powerful discovery of trait-associated methylation variation.

Additional investigations were performed for the most significant TG-associated CpG mapping to a regulatory region within CD36, which is known to function in fatty acid and glucose metabolism97,98. Here we were not only able to validate the results in the

MuTHER resource showing consistent direction of effect of the association of methylation variation in the regulatory region with TG but also show evidence of regulation of CD36 expression by our identified AT-specific regulatory region at the population level. These results of a potential AT regulation of CD36 expression were further strengthened by RNA-Seq data of purified adipocytes and multiple blood cell types showing pronounced difference in the expression pattern with adipocytes expressing CD36 at considerably higher level. As our discovery cohort included obese

74

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING individuals diagnosed with or without metabolic syndrome, we also tested another trait used for the diagnosis, HDL-C, in association with DNA methylation at our CpG of interest. Interestingly, we noted a similar association to HDL-C, which was further validated in the MuTHER cohort, indicating that epigenetic variants of CD36 may be able to serve as a biomarker for cardiovascular disease prediction in obese individuals similar to what has been suggested for circulating plasma CD36 for type 2 diabetes prediction103.

In conclusion, MCC-Seq provides high-resolution and cost-effective interrogation of functional methylomes in disease-relevant tissues with concurrent genotyping of potentially millions of SNPs. With its customizable panel design, our approach permits flexibility in both size and regions, to be interrogated for disease-associated epigenetic variant discovery. Our results demonstrate the significant utility of the approach over WGBS, Illumina 450K array and Agilent SureSelect methods. We demonstrate that targeting active regulatory regions for disease-associated DNA methylation CpG investigation is a valid strategy over whole-genome investigation.

Our data suggest that applying MCC-Seq in large cohorts will be a powerful approach to identify trait-associated methylation in studies of human disease.

75

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

2.7 Online Methods

2.7.1 First-generation Panel Design

We designed a first-generation capture panel (Met V1) targeting the functional methylome in human AT. Regions incorporated in the panel design included hypomethylated windows generated from merged WGBS data from 30 subcutaneous

AT samples derived from the MuTHER/TwinsUK cohort66. Briefly, the mean methylation levels per CpG were kept for those detected in at least 3 individuals resulting in 15,462,376 CpGs. We then calculated the probability of obtaining the specific methylation level (excluding complete hypomethylation corresponding to 0%) per CpG in our merged data set. The probabilities were then merged for three, four, five and ten consecutive CpGs within a window of 1 kb. As the majority of CpGs are hypermethylated with a mean methylation of 80%, hypomethylated windows corresponded to small probability estimates. Based on these probabilities, we then selected the bottom 2% of the different windows generated for inclusion in the Met

V1 panel design.

Next, AT-specific regulatory elements were incorporated into the panel design.

Regulatory elements (H3K4me1 and H3K4me3) from AT nuclei derived from five independent donors were downloaded from the NIH Roadmap Epigenomics Project as follows66,104. Aligned ChIP-Seq reads (BAM files) of the H3K4me1 and H3K4me3 marks, as well as the ChIP-Seq input, were downloaded from the NIH Roadmap

Epigenomics Project (GEO repository

76

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING accessions GSM621425, GSM669908, GSM669975, GSM670045, GSM772757, GSM

621435, GSM669925, GSM669988, GSM669998, GSM670041, GSM621401, GSM66

9934, GSM669940, GSM669984 and GSM670043). Each file of the H3K4me1 and

H3K4me3 marks was segmented into 100 bp bins. Within each bin, the sequence reads were counted. The bin counts were divided by the total number of sequence reads to obtain normalized intensity signals. ChIP-Seq input reads were processed in the same way and their normalized signal intensity values were subsequently subtracted from the normalized bin intensity signals. The H3K4me1 and H3K4me3 bins were then ranked according to these values. Based on the mean ranking across the five individuals, the top 1% bins per histone mark were then included in the panel design.

Finally, 53,638 Illumina 450K array probes with CpGs showing association (per-trait

Bonferroni P<0.05; nominal P<1.4.0 × 10−7) to metabolic phenotypes (for example,

BMI, total cholesterol, HDL-C, LDL-C and total TGs) were selected for inclusion in the Met V1 panel design (Supplementary Data 1). These associations were identified through an analysis of Ilumina 450K array AT methylation data collected from 648 female twins from the MuTHER/TwinsUK resource66.

In total, 79.6 Mb of sequence was targeted. Roche NimbleGen R&D was responsible for probe design. Each targeted region was extended to a minimum size of 100 bp and the capture probes were extended beyond the edge of each target to assure coverage yielding a total of 87.3 Mb of sequence in the final panel, which covered 99.2% of our input sequence (Supplementary Data 6). Only 978 of our selected targets failed in the

77

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING custom probe design. In total, the Met V1 panel targeted 2,496,975 CpGs of which

210,883 overlapped with Illumina 450K array sites.

2.7.2 Generation of Second-general Panel

The second-generation panel for adipose methylome capture (Met V2) was designed to cover 131 Mb including extension to 100 bp and additional flanking regions. We identified and incorporated into the panel design AT hypomethylated regions as described under ‘Identification of hypomethylated regions'. Inclusion was limited to

UMRs below a size of 7,000 bp and LMRs above 100 bp (excluding two large outliers; Supplementary Data 2). Selected hypomethylated regions covered 2,213,942 and 469,962 CpGs for UMRs and LMRs, respectively. Similar as in Met V1, AT regulatory regions were also incorporated into the panel design, selecting the 677,809 and 1,327,121 CpGs from the top 1% bins of regulatory elements (H3K4me1 and

H3K4me3) characterized in human adipocytes by the NIH Epigenome Roadmap consortium as described above. Furthermore, we included all 482,421 CpGs on the

Illumina 450K array and all 256,327 SNPs from the Illumina HumanCore SNPs.

Finally, we selected 28,947 metabolic disease GWAS SNPs from the GWAS catalogue for inclusion into the panel design.

We merged all selected regions using the R/Bioconductor package GenomicRanges.

Roche NimbleGen generated a 156.2-Mb panel based on our regions, covering 97.9% of our original targeted sequences in 629,845 regions (Supplementary Data 7).

Summary of the generated panel indicated that that 16,759 of our original targets

78

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING were unsuccessfully covered by the custom probes. We determined that the maximum

CpG coverage capacity of the Met V2 panel is 4,442,383 CpGs.

2.7.3 MCC-Seq Protocol

The MCC-Seq protocol was developed and optimized in collaboration with Roche

NimbleGen R&D. Briefly, in MCC-Seq a whole-genome sequencing library is prepared and bisulfite converted, amplified and a capture enriching for targeted bisulfite-converted DNA fragments is carried out. The captured DNA is further amplified and sequenced. More specifically, whole-genome sequencing libraries were generated from 700 to 1,000 ng of genomic DNA spiked with 0.1% (w/w) unmethylated λ DNA (Promega) previously fragmented to 300–400 bp peak sizes using the Covaris focused-ultrasonicator E210. Fragment size was controlled on a

Bioanalyzer DNA 1000 Chip (Agilent) and the KAPA High Throughput Library

Preparation Kit (KAPA Biosystems) was applied. End repair of the generated dsDNA with 3′- or 5′-overhangs, adenylation of 3′-ends, adaptor ligation and clean-up steps were carried out as per KAPA Biosystems' recommendations. The cleaned-up ligation product was then analysed on a Bioanalyzer High Sensitivity DNA Chip (Agilent) and quantified by PicoGreen (Life Technologies). Samples were then bisulfite converted using the Epitect Fast DNA Bisulfite Kit (Qiagen), according to the manufacturer's protocol. Bisulfite-converted DNA was quantified using OliGreen (Life Technologies) and, based on quantity, amplified by 9–12 cycles of PCR using the Kapa Hifi

Uracil+DNA polymerase (KAPA Biosystems), according to the manufacturer's

79

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING protocol. The amplified libraries were purified using Ampure Beads and validated on

Bioanalyzer High Sensitivity DNA Chips, and quantified by PicoGreen.

Next, SeqCap Epi Enrichment System protocol (Roche NimbleGen) was carried out for the capture. The hybridization procedure of the amplified bisulfite-converted library was performed as described by the manufacturer, using 1 µg of total input of library, which was evenly divided by the libraries to be multiplexed, and incubated at

47 °C for 72 h. Washing and recovering of the captured library, as well as PCR amplification and final purification, were carried out as recommended by the manufacturer. Quality, concentration and size distribution of the captured library was determined by Bioanalyzer High Sensitivity DNA Chips. Each capture was sequenced on the Illumina HiSeq2000/2500 system using 100 bp paired-end sequencing.

2.7.4 MCC-Seq Methylation Profiling

Reads were aligned to the bisulfite-converted reference genome using BWA v.0.6.192.

We removed the following: (i) clonal reads, (ii) reads with low mapping quality score

(<20), (iii) reads with >2% mismatch to converted reference over the alignment length, (iv) reads mapping on the forward and reverse strand of the bisulfite- converted genome, (v) read pairs not mapped at the expected distance based on library insert size and (vi) read pairs that mapped in the wrong direction as described by Johnson et al.105. To avoid potential biases in downstream analyses, CpGs were further filtered as follows: CpGs not covered by at least five reads, CpGs not covered by at least two reads per strand, CpGs overlapping an SNP (dbSNP 137) and sites

80

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING overlapping DAC Blacklisted Regions or Duke Excluded Regions generated by the

ENCODE project:

(http://hgwdev.cse.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability).

We further selected CpGs sites that exhibited ≤20% methylation difference between strands. Finally, all off-target reads were removed. Methylation values at each site were calculated as total (forward and reverse) non-converted C-reads over total

(forward and reverse) reads. CpGs were included in subsequent analysis if the number of sequence reads was five or greater. In some analyses, we also excluded sites at which the average sequence depth over all study individuals was below the

20th percentile in the complete data set. CpGs were counted once per location combining both strands together.

2.7.5 Illumina 450K Array Methylation Profiling

Bisulfite conversion was conducted on 1 µg of a subset of 24 VAT DNA samples and quantitative DNA methylation analysis was carried out at the McGill University and

Génome Québec Innovation Centre (Montreal, Canada). Infinium

HumanMethylation450 BeadChip (Illumina) was processed according to the manufacturer's instructions. Methylation data were visualized and analysed using the GenomeStudio software version 2011.1 (Illumina) and the Methylation Module.

None of the samples were excluded following quality control steps assessed by bisulfite conversion, extension, staining, hybridization, target removal, negative and nonpolymorphic control probes. Methylation levels (β-values) were estimated as the

81

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING ratio of signal intensity of the methylated alleles to the sum of methylated and unmethylated intensity signals of the alleles (β-value=C/(T+C)). The β-values vary from 0 (no methylation) to 1 (100% methylation). Methylation β-values were further quantile normalized to remove unwanted technical variation, using control probes as recently presented93.

2.7.6 Agilent SureSelect CpG Profiling and MCC-Seq Comparisons

An MCC-Seq panel (Roche NimbleGen) was designed to mimic the SureSelect Human

Methyl-Seq panel (Agilent) by designing probes against the same genomic coordinates, but targeting both DNA strands. As the MCC-Seq protocol hybridizes probes to library fragments after bisulfite treatment and PCR amplification, when the sequences of those fragments may be highly variable depending on the CpG density and initial methylation status of each CpG within each original DNA molecule, multiple probes with different sequences were designed to permit effective hybridization capacity over the full range of possible post-bisulfite sequences. The

MCC-Seq and SureSelect Methyl-Seq capture experiments were executed at Roche

NimbleGen (Madison, WI), while the SureSelect Methyl-Seq captures and sequencing were performed by a third-party service provider, according to manufacturer's instructions, using 1 µg (MCC-Seq) or 3 µg (SureSelect) of DNA extracted from the

LCL GM12878 cell line.

MCC-Seq reads were filtered according to our bioinformatics pipeline described above

(MCC-Seq methylation profiling). Given the single-strand bias of the Agilent method, no filters were applied on the Agilent SureSelect data. Comparisons of the

82

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING methylation calls from both methods were made for overlapping sites at ≥5X

(N=2,551,186 CpGs) and at ≥10X (N=2,496,975 CpGs).

2.7.7 Trait-association Discovery Cohort

Between June 2000 and July 2012, 1,906 severely obese men (N=597) and women

(N=1,309) undergoing biliopancreatic diversion with duodenal switch106 at the

Quebec Heart and Lung Institute (Quebec City, Quebec, Canada) were recruited.

Subjects had fasted overnight before the surgical procedure. Anaesthesia was induced by a short-acting barbiturate and maintained by fentanyl and a mixture of oxygen and nitrous oxide. VAT samples were obtained within 30 min of the beginning of the surgery from the greater omentum107. Here, a subset of the VAT cohort was included corresponding to 72 individuals (BMI >40 kg m−2; discovery cohort) free of metabolic diseases such as type 2 diabetes, cardiomyopathy, or endocrine disorders. Thirty-five individuals were deemed to have metabolic syndrome (MetS+ group), while the remaining 37 were not affected (MetS− group). The presence of MetS was determined by the National Cholesterol Education Program Adult Treatment Panel III guidelines when an individual fulfilled three or more criteria11. None of the study participations was on medication to treat MetS features. The sample collection of AT was approved by the Université Laval and McGill University (IRB FWA00004545) ethics committee and performed in accordance with the principles of the Declaration of Helsinki. Tissue banking and the severely obese cohort were approved by the research ethics committees of the Quebec Heart and Lung Institute. All participants provided written informed consent before enrolment in the study.

83

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Body weight, height, waist girth and resting systolic and diastolic blood pressure were measured preoperatively by standardized procedures. BMI was calculated as weight in kilograms divided by height in metres squared. Plasma total cholesterol (total-C),

TG and HDL-C levels were measured using enzymatic assays. HDL-C was measured in the supernatant following precipitation of very-low-density lipoproteins and low- density lipoproteins with dextran sulphate and magnesium chloride. Plasma LDL-C levels were estimated with the Friedewald formula. Fasting glucose concentrations were enzymatically measured108.

2.7.8 DNA Isolation

Genomic DNA was extracted from 200 mg of all 72 VAT samples using the DNeasy

Blood & Tissue kit (Qiagen), as recommended by the manufacturer, and quantified using both NanoDrop Spectrophotometer (Thermo Scientific) and PicoGreen DNA methods.

2.7.9 Identification of Hypomethylated Regions

We merged WGBS data from 30 healthy individuals, filtered as described under

‘MCC-Seq methylation profiling', to define AT-specific hypomethylated regulatory regions. A minimum of three individuals per CpG was set as a threshold for inclusion into the merged set. We applied the R/Bioconductor package MethylSeekR to the data set, to identify and define regulatory regions as LMRs and UMRs. Briefly, this package uses a cutoff method wherein UMRs and LMRs are predicted at single-base resolution as regions of consecutive CpGs having methylation statuses under a set

84

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING level with UMRs being differentiated from LMRs based on a minimum content of 30

CpGs61. By default, a methylation threshold of 50% and false-discovery rate of 5% was set for the analysis61, fixing consecutive CpGs at ≥4. We identified 20,195 UMRs and 45,065 LMRs for AT. The same procedure was carried out in WGBS data collected from whole blood samples of the same cohort, identifying 19,871 UMRs and 46,159

LMRs. We intersected the AT and whole blood hypomethylated regions and found

2,342 and 24,687 AT-unique UMRs and LMRs, respectively.

2.7.10 Genotyping

The same samples (N=24) included for Illumina 450K methylation profiling were also selected for high-density genotyping using the Illumina HumanOmni2.5-8 (Omni2.5)

BeadChip according to protocols recommended by Illumina. After applying quality control filters, genotypes were retrained for 2,132,665 SNP sites. Simultaneous genotypes calls from MCC-Seq data (Met V1) were inferred using the Bis-

SNP109 software, a bisulfite-sequencing variant caller, with default parameters: ‘-T

BisulfiteGenotyper -stand_call_conf 20 -stand_emit_conf 0 -mmq 30 -mbq 17 - minConv 0' and with dbSNP 137 as prior SNP information. The aligned bam files were used as input file and the hg19 was used as the reference genome. These genotypes were then compared with the genotypes from HumanOmni-2.5 M genotyping data.

85

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

2.7.11 Epigenome-wide Association of TG Levels

Associations of methylation levels of CpGs detected in VAT (N=72) with TG levels were tested using a GLM function implemented in R3.1.1. Two outliers in TG levels were identified by setting a cutoff of mean±3*s.d. and removed from any further analysis. The response variable (methylation levels) was fitted to a binomial distribution weighted for sequence read coverage at each site and adjusted for age, sex and BMI. All CpGs associated with TG at P<0.05 were subjected to permutation tests, to establish the significance of phenotype effect as follows: the DNA methylation values for each CpG were permuted 10,000 times and the GLM was fitted at each permutation round. Permutation P-values were established by counting how many times the permuted association resulted in significance smaller than the observed

GLM P-value for each CpG. Replication of the 2,580 MCC-Seq TG-associated CpGs with permutation P≤0.001 was conducted in AT methylation data from an independent cohort of 648 female individuals in the MuTHER cohort. Associations between Illumina 450K array methylation data and TG levels were assessed using a linear mixed model taking into account familial relationship, twin zygosity and other cofactors (that is, age, beadchip, bisulfite conversion efficiency and bisulfite-treated

DNA input) and summary statistics were obtained from http://www.sanger.ac.uk/resources/software/genevar/. Expanding to 250 bp flanking regions around MCC-Seq TG-associated sites, we were able to assess replication status in 1,582 sites. Expression QTL data available from this same cohort

86

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING was further used to validate the TG association established within the MCC-Seq data at the CD36 loci (http://www.sanger.ac.uk/resources/software/genevar/).

2.7.12 Adipocyte Nuclei Isolation

Subcutaneous and visceral adipose tissues were collected from obese individuals undergoing biliopancreatic diversion with duodenal switch. Mature adipocytes were isolated as follows110: freshly sampled adipose tissues were minced and digested in

Krebs Ringer Henseleit Buffer (1 M HEPES, 2 M NaCl, 1 M KCl, 1 M CaCl2, 1 M

MgCl2, 1 M K2HPO4, pH 7.4) supplemented with 5 mM glucose, 0.1 µM adenosine,

0.1 mg ml−1 ascorbic acid, 4% electrophoresis grade, delipidated BSA and

350 U ml−1 collagenase (Worthington Biochemical Corp., Lakewood, NJ) for 45 min with agitation (37 °C). Adipocyte suspensions were filtered through nylon mesh and washed with the buffer. Isolated adipocytes were homogenized in two volumes of lysis buffer (25 mM Tris pH 7.5, 5 mM MgCl2, 0.5% Triton X-100, 0.3 M sucrose and protease inhibitors) for 2 min on ice, then centrifuged at 3,220g for 25 min (4 °C). The pellets were washed twice with lysis buffer and resuspended in nuclei storage buffer

(50 mM Tris pH 7.8, 5 mM MgCl2, 0.1 mM EDTA, 0.1 mM dithiothreitol, 40%v/v glycerol) for freezing.

2.7.13 Transposase-accessible chromatin Sequencing

ATAC-Seq libraries were generated on 100,000 mature adipocyte nuclei using a modified protocol to that published recently96. More precisely, transposase reaction was carried out for 30 min at 37 °C in a 25-µl reaction volume using 10X transposase

87

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING concentration (Illumina Nextera Kit). EDTA (25 mM) was added to the reaction mix and transferred to ice before recovering DNA using MinElute PCR Purification columns (Qiagen). Next, samples were PCR enriched (ten cycles; Supplementary

Table 6) and DNA was isolated using GeneRead Purification columns (Qiagen).

Libraries were quantified by quantitative PCR (Supplementary Table 7), Picogreen and LabChip, then were sequenced on the Illumina HiSeq2500 pair-ended 100 bp, using the Nextera sequencing primers.

Raw reads were trimmed for quality (phred33 ≥30) and length (n≥32), and Illumina adapters were clipped off using Trimmomatic v. 0.22111. Filtered reads were aligned to the hg19 human reference using BWA v.0.6.192. Peaks were called without a control using MACS v. 2.0.10.07132012112 at a q-value cutoff of 0.05.

2.7.14 Blood Cell Isolation

Peripheral blood mononuclear cells were purified from buffy coats originating from

450 ml blood of healthy blood donors (Uppsala Blood Transfusion Center, Uppsala

University Hospital, Sweden), using Ficoll-Paque (GE Healthcare) density-gradient centrifugation. B cells, T cells and monocytes were isolated from dedicated batches of peripheral blood mononuclear cells, using positive selection with CD19+, CD3+ and

CD14+ beads (Miltenyi Biotec), respectively, according to the manufacturer's instructions.

88

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

2.7.15 RNA Sequencing

RNA isolations were performed using miRNeasy Mini Kit (Qiagen). RNA library preparations were carried out on 500 ng of RNA with RNA integrity number (RIN)>7 isolated from adipocyte cells extracted from AT113,114 and blood cells (CD19+, CD3+ and CD14+) using the Illumina TruSeq Stranded Total RNA Sample preparation kit, according to manufacturer's protocol. Final libraries were analysed on a Bioanalyzer and sequenced on the Illumina HiSeq2000 (pair-ended 100 bp sequences). Raw reads were trimmed for quality (phred33≥30) and length (n≥32), and Illumina adapters were clipped off using Trimmomatic v. 0.32111. Filtered reads were aligned to the hg19 human reference using Tophat v.2.0.10115 and bowtie v.2.1.0116. Raw read counts of

UCSC genes were obtained using htseq-count v.0.6.1 (http://www- huber.embl.de/users/anders/HTSeq). Differential expression analysis was done using

DESeq41 including adipocytes isolated from AT (subcutaneous and visceral) of four obese individuals undergoing bariatric surgery and different blood cell types (B cells,

T cells and monocytes) of four healthy individuals (Uppsala Blood Transfusion

Center, Uppsala University Hospital, Sweden).

89

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

2.8 Acknowledgements

We thank J. Wendt (Roche NimbleGen) for useful input on the experiments. We also thank W. Cheung, M. Turgeon and C. Greenwood at McGill University for bioinformatics and statistical support. This work was supported by a Canadian

Institute of Health Research (CIHR) team grant awarded to E.G., A.T., M.C.V. and

M.L. (TEC-128093) and the CIHR funded Epigeneome Mapping Centre at McGill

University (EP1-120608) awarded to T.P. and M.L., and the Swedish Research

Council, Knut and Alice Wallenberg Foundation and the Torsten Söderberg

Foundation awarded to L.R. F.A. holds studentship from The Research Institute of the McGill University Health Center (MUHC). F.G. is a recipient of a research fellowship award from the Heart and Stroke Foundation of Canada. A.T. is the director of a Research Chair in Bariatric and Metabolic Surgery. M.C.V. is the recipient of the Canada Research Chair in Genomics Applied to Nutrition and Health

(Tier 1). E.G. and T.P. are recipients of a Canada Research Chair Tier 2 award. The

MuTHER Study was funded by a programme grant from the Wellcome Trust

(081917/Z/07/Z) and core funding for the Wellcome Trust Centre for Human Genetics

(090532). TwinsUK was funded by the Wellcome Trust; European Community's

Seventh Framework Programme (FP7/2007-2013). The study also receives support from the National Institute for Health Research (NIHR)-funded BioResource, Clinical

Research Facility and Biomedical Research Centre based at Guy's and St Thomas'

NHS Foundation Trust in partnership with King's College London. T.D.S. is a holder of an ERC Advanced Principal Investigator award. SNP genotyping was performed

90

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING by The Wellcome Trust Sanger Institute and National Eye Institute via NIH/CIDR.

Finally, we thank the NIH Roadmap Epigenomics Consortium and the Mapping

Centers (http://nihroadmap.nih.gov/epigenomics/) for the production of publicly available reference epigenomes. Specifically, we thank the mapping centre at

MGH/BROAD for generation of human adipose reference epigenomes used in this study.

91

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

2.9 Additional Information

Accession codes

The methylation 450K data has been deposited in the Gene Expression Omnibus

(GEO), http://www.ncbi.nlm.nih.gov/geo (accession no. GSE59524). All MCCSeq data from the discovery cohort as well as adipocyte ATAC-Seq and RNA-Seq data can be visualized in the UCSC Genome Browser, http://genome.ucsc.edu, using the Track

Hub Data feature (‘McGill Adipose Tissue Epigenome’) by adding the following URL to ‘My Hubs’: http://hubs.hpc.mcgill.ca/Belin/Adipose_MCCSeq_Hub.txt. All processed MCCSeq data from the discovery cohort and from the adipocyte RNA-Seq analyses are available in the ArrayExpress database (www.ebi.ac.uk/arrayexpress) accession no. E-MTAB-3181 and E-MTAB-3182). Raw reads from RNA-Seq, ATAC-

Seq and MCC-Seq are deposited to the European Genome-phenome Archive (EGA) and available after approval by the Data Access Committee (DAC) designated to the study (https://www.ebi.ac.uk/ega/home).

Competing financial interests

The authors declare no competing financial interests.

92

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

The Multiple Tissue Human Expression Resource Consortium

Kourosh R. Ahmadi14, Chrysanthi Ainali15, Amy Barrett9, Veronique Bataille14,

Jordana T. Bell14, Alfonso Buil16, Emmanouil T. Dermitzakis16, Antigone S. Dimas8,16,

Richard Durbin11, Daniel Glass14, Neelam Hassanali9, Catherine Ingle11, David

Knowles17, Maria Krestyaninova18, Cecilia M. Lindgren8, Christopher E. Lowe19,20,

Eshwar Meduri11,14, Paola di Meglio22, Josine L. Min8, Stephen B. Montgomery16,

Frank O. Nestle22, Alexandra C. Nica16, James Nisbet11, Stephen O’Rahilly19,20,

Leopold Parts11, Simon Potter11, Johanna Sandling11, Magdalena Sekowska11, So-

Youn Shin11, Kerrin S. Small14, Nicole Soranzo11, Gabriela Surdulescu14, Mary E.

Travers9, Loukia Tsaprouni11, Sophia Tsoka15, Alicja Wilk11, Tsun-Po Yang11, Krina

T. Zondervan8

15Department of Informatics, School of Natural and Mathematical Sciences, King’s

College London, Strand, London, UK; 16Department of Genetic Medicine and

Development, University of Geneva Medical School, Geneva,

Switzerland; 17University of Cambridge, Cambridge, UK; 18European Bioinformatics

Institute, Hinxton, UK; 19University of Cambridge Metabolic Research Labs,

Institute of Metabolic Science Addenbrooke’s Hospital Cambridge, UK; 20Cambridge

NIHR Biomedical Research Centre, Addenbrooke’s Hospital, Cambridge,

UK; 21Oxford NIHR Biomedical Research Centre, Churchill Hospital, Oxford,

UK; 22St. John's Institute of Dermatology, King's College London, London, UK

93

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

2.10 Main Tables and Figures

2.10.1 Tables

Table 1. Composition of Met V1 and Met V2 Panels

Panel Components Met V1 Met V2 AT-hypomethylated footprints CpGs (N) 1,089,355 2,683,904 AT-regulatory elements (H3K4me1 and me3) CpGs 1,625,328 1,625,328 (N) Illumina 450K CpGs (N) 210,883 482,421 Metabolic trait-associated SNPs (N) -- 28,947 Core SNPs (N) -- 256,327 Total covered regions (Mb) 87.0 156.2 Total covered CpGs (N) 2,496,975 4,442,383 Total covered SNPs (N) 1,343,928 2,840,815

94

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

2.10.2 Figures

Figure 1. Technical replication of MCC-Seq methylation calls and comparison with

WGBS Figure 1

Correlation between technical replicates from a DNA sample derived from visceral adipose tissue (VAT) sequenced from independent captures (a) of the same MCC-Seq sequence panel (Met V1) (4-plex and 10-plex; N=1,587,026 CpGs; R=0.98) and (b) of two different MCC-Seq sequence panels (Met V14-plex and Met V26-plex; N=1,569,170

CpGs; R=0.97). (c) Comparison between MCC-Seq4-plex and WGBS (N=1,620,874

CpGs; R=0.97) methylation calls for the same VAT DNA sample. Only CpGs with sequence coverage ≥5X in MCC-Seq and WGBS experiments were included in the analysis; R is the Pearson correlation coefficient.

95

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Figure 2. Comparison of methylation calls obtained with different methods

Figure 2

(a) Correlation between MCC-Seq4-plex and Illumina 450K array methylation calls for the same VAT DNA sample (R=0.96), (b) comparison between WGBS and Illumina

450K array results (R=0.96) and (c) comparison between WGBS and MCC-Seq4-plex results (R=0.97). Only CpGs with data available from all three techniques were included (N=150,898 CpGs); we required sequence coverage ≥5X for MCC-Seq and

WGBS; R is the Pearson correlation coefficient.

96

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Figure 3. Annotation of triglycerides (TG)-associated CpGs in putative regulatory regions

Figure 3

a permutation p<=0.001 permutation p<=0.0001 2.2 ***

1.8 *** *** 1.4 *** Fold Change

1 AT putative enhancer AT unique putative enhancer

0.6

b 2.2

1.8 *

** 1.4 Fold Change 1 *** *

AT putative promoter AT unique putative promoter 0.6

CpGs with average reads coverage above the 20th percentile that showed evidence of association with TG (p≤0.001 or p≤0.0001) were annotated with additional data. (a)

This panel shows significant enrichment (y-axis, fold-change) of TG-associated CpGs for p≤0.001 (orange bars) and p≤0.0001 (grey bars) within putative enhancer regions as defined by H3K4me1 marks and/or LMRs (*** denotes p=5.3x10-7 for p≤0.001 and p=4.9x10-5 for p≤0.0001 CpGs, respectively), and for H3K4me1 marks and/or LMRs

97

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING specific to AT (*** denotes p=6.0x10-10 for p≤0.001 and p=4.1x10-7 for p≤0.0001 CpGs, respectively). (b) This panel shows significant depletion (y-axis, fold-change) of the same TG-associated CpGs significance (p≤0.001 shown as orange bars and p≤0.0001 shown as grey) within putative promoter regions as demarcated by H3K4me3 marks and/or UMRs (*** denotes p=7.1x10-10 for CpGs with p≤0.001 and * p=0.023 for CpGs with p≤0.0001) but enrichment when restricting to H3K4me3 marks and/or UMRs specific to AT (** denotes p=2.4x10-3 for CpGs with p≤0.001 and * p=0.020 for CpGs with p≤0.0001). Enrichment was established using Fisher’s exact test.

98

CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Figure 4. Top TG-associated CpG mapping to an AT-specific regulatory region – CD36

Figure 4 Scale 20 kb hg19 chr7: 80,240,000 80,260,000 80,280,000 80,300,000 Discovery cohort HDL-chr7:80,276,086 (p=2.9E-05) TG-chr7:80,276,086 (p=1.1E-09) Replication cohort HDL-cg05917188 (p=1.8E-03) TG-cg05917188 (p=3.2E-07) AT meth-exp ILMN_1665132 cg05917188 (p=2.4E-05) association AT hypomethylated LMR LMR LMR footprints UMR RefSeq Genes (CD36)

Adipocyte RNA-Seq 7000 _ (forward) 0 _

The top TG-associated CpG (chr7:80,276,086-80,276,087; p= 1.1x10-9; generalized linear model assuming a binomial distribution; turquoise track) identified in the discovery cohort maps within an intragenic region of CD36, which overlaps an AT- specific LMR (black track). Investigation within a population-based cohort (N~650) replicated the epigenetic effect in a nearby 450K array probes (orange track); mapping to the same regulatory region (cg05917188; p=3.2x10-7; linear mixed model).

The methylation status of the latter probe was also found to be negatively associated to CD36 expression in AT (ILMN_1665132; p=6.7.x10-5; linear mixed model, pink track). AT-specific expression of the gene was also noted through AT RNA-Seq data

(purple track).

99 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

2.11 Supplementary Materials

2.11.1 Supplementary Tables

Supplementary Table 1. Sequence statistics of the MetV1 pooled samples

Total Total Average Total on- Total CpG CpG Total Total on- coverage target On-target captured sites sites Plexing Sample aligned target on- raw reads aligned aligned on-target with with level ID reads reads target reads reads CpG sites >=5 >=10 (%) (%) CpG (%) (%) reads reads sites (%) (%) 1 1 257,885,682 60 67 40 95,832,520 82 99 94 94 2 1 151,069,388 60 71 43 58,871,218 53 99 91 90 2 2 144,059,126 62 69 43 57,386,604 50 99 90 89 4 3 65,799,220 70 79 55 35,355,720 33 99 82 80 4 4 62,929,502 72 80 57 34,630,762 33 99 82 79 Individual 4 5 57,375,348 72 79 57 31,638,544 30 99 80 77 4 6 51,172,122 73 78 57 28,258,060 27 99 80 76 6 1 65,153,260 74 55 41 26,090,152 24 99 80 76 6 2 51,084,062 76 56 43 21,392,302 21 99 77 71 6 7 59,331,638 73 55 40 23,132,120 21 99 77 71 6 8 72,222,874 74 55 41 29,161,380 27 99 82 78 6 9 65,481,242 73 54 39 25,251,428 24 99 80 75 6 10 71,864,932 72 54 39 27,650,522 25 99 80 76 10 11 36,474,056 75 59 44 15,739,422 14 98 65 54 10 12 32,548,084 76 59 45 14,374,870 13 98 63 51 10 13 32,058,534 76 59 45 14,139,984 13 98 62 49

100 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

10 14 29,780,130 75 60 45 13,047,378 12 98 60 46 10 15 38,922,258 74 58 43 16,445,576 15 98 67 56 10 16 34,623,254 74 57 42 14,442,798 13 98 62 50 10 17 40,850,638 75 51 38 15,457,642 14 98 66 55 10 18 28,196,414 74 58 43 11,877,752 10 98 55 38 10 19 31,954,644 76 60 46 14,209,644 13 98 63 50 10 20 32,427,388 75 59 44 14,171,532 13 98 62 49 1 NA 257,885,682 60 67 40 95,832,520 82 99 94 94 2 NA 147,564,257 61 70 43 58,128,911 51 99 90 90 Average 4 NA 59,319,048 72 79 57 31,194,648 31 99 81 78 6 NA 64,189,668 74 55 41 25,446,317 24 99 79 75 10 NA 33,783,540 75 58 44 14,390,660 13 98 63 50

101 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Supplementary Table 2. Sequence statistics of the Met V2 pooled samples

Total Total Total CpG CpG Total on- Average Total Total on- On-target captured sites sites Plexing Sample target coverage raw reads aligned target aligned on-target with with level ID aligned on-target reads (%) reads (%) reads CpG >=5 >=10 reads (%) CpG sites sites (%) reads reads (%) (%) Individual 6 21 52,965,668 71 62 44 23,013,966 13 98 62 50 6 22 52,005,852 72 62 44 22,681,612 13 98 62 50 6 23 76,717,544 73 62 45 34,152,304 20 98 73 67 6 24 59,006,738 73 61 45 26,096,106 15 98 65 54 6 25 53,915,196 71 62 44 23,447,444 14 98 63 51 6 26 60,990,908 72 63 45 27,036,892 16 98 67 58 Average 6 NA 59,266,984 72 62 45 26,071,387 15 98 65 55

102 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Supplementary Table 3. Comparison of MCC-Seq methylation calls with Illumina 450K array and WGBS data at various read depths

Pearson Correlation Technique comparisons >=5X >=10X >=20X >=30X (N=150,898 CpGs) (N=144,868 CpGs) (N=90,547 CpGs) (N=17,852 CpGs) 450K~MCC-Seq 0.964 0.964 0.965 0.962 450K~WGBS 0.962 0.962 0.961 0.959 MCC-Seq~WGBS 0.974 0.974 0.974 0.972

103 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Supplementary Table 4. Comparison of MCC-Seq methylation calls with Illumina 450K array and WGBS data at various read depths excluding completely hypo and hypermethylated CpGs

Pearson Correlation Technique comparisons >=5X >=10X >=20X >=30X (N=45,097 CpGs) (N=44,414 CpGs) (N=30,934 CpGs) (N=7,182 CpGs) 450K~MCC-Seq 0.946 0.947 0.952 0.953 450K~WGBS 0.942 0.943 0.946 0.947 MCC-Seq~WGBS 0.949 0.950 0.955 0.958

104 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Supplementary Table 5. MuTHER replication and cis-mQTL regulation of top TG- associated CpGs

Supplementary Table 5 can be accessed online from the open access publication Allum et al.117 in Nature Communications using the following link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4544751/

105 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Supplementary Table 6. ATAC-Seq PCR amplification primers

ID Adaptor Sequence Primer Sequence Ad1 n/a AATGATACGGCGACCACCGAGATCTACACTCGTCGGCAGCGTCAGATGTG Ad2.1 TAAGGCGA CAAGCAGAAGACGGCATACGAGATTCGCCTTAGTCTCGTGGGCTCGGAGATGT Ad2.2 CGTACTAG CAAGCAGAAGACGGCATACGAGATCTAGTACGGTCTCGTGGGCTCGGAGATGT Ad2.3 AGGCAGAA CAAGCAGAAGACGGCATACGAGATTTCTGCCTGTCTCGTGGGCTCGGAGATGT Ad2.4 TCCTGAGC CAAGCAGAAGACGGCATACGAGATGCTCAGGAGTCTCGTGGGCTCGGAGATGT Ad2.5 GGACTCCT CAAGCAGAAGACGGCATACGAGATAGGAGTCCGTCTCGTGGGCTCGGAGATGT Ad2.6 TAGGCATG CAAGCAGAAGACGGCATACGAGATCATGCCTAGTCTCGTGGGCTCGGAGATGT Ad2.7 CTCTCTAC CAAGCAGAAGACGGCATACGAGATGTAGAGAGGTCTCGTGGGCTCGGAGATGT Ad2.8 CAGAGAGG CAAGCAGAAGACGGCATACGAGATCCTCTCTGGTCTCGTGGGCTCGGAGATGT Ad2.9 GCTACGCT CAAGCAGAAGACGGCATACGAGATAGCGTAGCGTCTCGTGGGCTCGGAGATGT Ad2.10 CGAGGCTG CAAGCAGAAGACGGCATACGAGATCAGCCTCGGTCTCGTGGGCTCGGAGATGT Ad2.11 AAGAGGCA CAAGCAGAAGACGGCATACGAGATTGCCTCTTGTCTCGTGGGCTCGGAGATGT Ad2.12 GTAGAGGA CAAGCAGAAGACGGCATACGAGATTCCTCTACGTCTCGTGGGCTCGGAGATGT Ad2.13 GTCGTGAT CAAGCAGAAGACGGCATACGAGATATCACGACGTCTCGTGGGCTCGGAGATGT Ad2.14 ACCACTGT CAAGCAGAAGACGGCATACGAGATACAGTGGTGTCTCGTGGGCTCGGAGATGT Ad2.15 TGGATCTG CAAGCAGAAGACGGCATACGAGATCAGATCCAGTCTCGTGGGCTCGGAGATGT Ad2.16 CCGTTTGT CAAGCAGAAGACGGCATACGAGATACAAACGGGTCTCGTGGGCTCGGAGATGT Ad2.17 TGCTGGGT CAAGCAGAAGACGGCATACGAGATACCCAGCAGTCTCGTGGGCTCGGAGATGT Ad2.18 GAGGGGTT CAAGCAGAAGACGGCATACGAGATAACCCCTCGTCTCGTGGGCTCGGAGATGT Ad2.19 AGGTTGGG CAAGCAGAAGACGGCATACGAGATCCCAACCTGTCTCGTGGGCTCGGAGATGT Ad2.20 GTGTGGTG CAAGCAGAAGACGGCATACGAGATCACCACACGTCTCGTGGGCTCGGAGATGT Ad2.21 TGGGTTTC CAAGCAGAAGACGGCATACGAGATGAAACCCAGTCTCGTGGGCTCGGAGATGT Ad2.22 TGGTCACA CAAGCAGAAGACGGCATACGAGATTGTGACCAGTCTCGTGGGCTCGGAGATGT Ad2.23 TTGACCCT CAAGCAGAAGACGGCATACGAGATAGGGTCAAGTCTCGTGGGCTCGGAGATGT Ad2.24 CCACTCCT CAAGCAGAAGACGGCATACGAGATAGGAGTGGGTCTCGTGGGCTCGGAGATGT

106 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Supplementary Table 7. ATAC-Seq Q-PCR primers

ID Sequence Mit_+ve_F CTAAATAGCCCACACGTTCCC Mit_+ve_R AGAGCTCCCGTGAGTGGTTA GAPDH_+ve_F CTGTCCCTTCAGTAGCTGCC GAPDH_+ve_R GAAGAGAGTGGGTTGGTGGG GAPDH_-ve_F TCTGGATGGCCTGAAGGAGA GAPDH_-ve_R GCCAGCAGCACTCATGTTTC ACTB_+ve_F GAGTCCTTAGGCCGCCAG ACTB_+ve_R TCCGACCAGTGTTTGCCTTT ACTB_-ve_F CATCTCGTGTCCAGTGCAGA ACTB_-ve_R CCATGCAATGTGGGAGTCCT

107 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

2.11.2 Supplementary Figures

Supplementary Figure 1. Extended comparison of MCC-Seq methylation calls with

WGBS and the Illumina 450K array excluding completely hypo and hypermethylated

CpGs

(a) Correlation between MCC-Seq4-plex and Illumina 450K array methylation calls for the same VAT DNA sample (R=0.95), (b) comparison between WGBS and Illumina

450K array results (R=0.94) and (c) comparison between WGBS and MCC-Seq4-plex results (R=0.95). Only CpGs with data available from all three techniques were included as well as excluding completely hypo and hypermethylated CpGs

(N=150,898 CpGs); we required sequence coverage ≥5X for MCC-Seq and WGBS; R is the Pearson correlation coefficient.

108 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Supplementary Figure 2. Correlation between Illumina 450K array and MCC-Seq methylation calls at different read coverage

Pearson correlation between Illumina 450K array and MCC-Seq methylation calls for the same VAT DNA sample is shown at different read depth (fold, x-axis).

109 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Supplementary Figure 3. Comparison of MCC-Seq methylation calls with Agilent

SureSelect

Correlations between MCC-SeqMimic and Agilent SureSelect methylation calls for the

LCL GM12878 cell line (a) using all overlapping CpGs (N=2,551,186; R=0.99) and (b) excluding hypo (0%) and hypermethylated CpGs (N=1,734,371; R=0.99). Only CpGs with data available from both techniques were included; we required sequence coverage ≥5X for MCC-Seq and WGBS accordingly; R is the Pearson correlation coefficient; density scales are independent in the two panels.

110 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Supplementary Figure 4. Distributions of sequence coverage at included CpG sites

a b

70000 70000

50000 50000

30000 30000 Number of CpGs Number of CpGs

10000 10000

0 0

20 40 60 80 100 20 40 60 80 100

Average read depth Average read depth

Distributions of the average read depth across the 72 individuals of included CpG sites when considering (a) all CpG sites with ≥5X and ≤ 100X sequence depth and (b) sites meeting these criteria and with average sequence depth above the 20th percentile.

111 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Supplementary Figure 5. Outline of the trait association and population-based validation studies

In the left panel, an outline of the subsets used for different analysis in the association-study of methylation to triglyceride levels as well as main results are presented. In the right panel, the outline of the population-based validation studies comparing methylation values obtained with the 450K array and MCC-Seq in 24 individuals is shown.

112 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Supplementary Figure 6. Average methylation pattern of CpGs captured with MCC-

Seq Met V1 design

The figure shows the average methylation values (%, x-axis) for 72 visceral adipose tissue samples at on-target CpG sites above the 20th percentile average reads coverage (N=1,710,209) from Met V1 capture experiments.

113 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Supplementary Figure 7. Characterization of adipose hypomethylated footprints

a b

50 50 40 40 ylation (%) ylation (%) 30 30 h h 20 20 median met median met 10 10

0 0

2 4 6 8 10 12 LMR UMR log2 number of CpGs in segment

Hypomethylated footprints were generated from WGBS on AT using the

R/Bioconductor package MethylSeekR. (a) Unmethylated regions (UMRs, right of dotted line) and low-methylated regions (LMRs, left of dotted line) were differentiated by a 30 CpG content threshold. (b) LMRs (N=45,065) and UMRs (N=20,195) show different median methylation patterns with LMRs having a broader range of methylation and UMRs being less variable in their methylation status and associated to low-methylated promoter regions. Boxplot whiskers represent 1.5*IQR (inter quartile range).

114 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Supplementary Figure 8. Variability of enhancer and promoter associated CpG sites

a b

20000

250000

15000

150000 10000

Number of CpGs Number of CpGs

5000

50000

0 0 0 10 20 30 0 10 20 30 40

Standard deviation Standard deviation

Interrogated CpG sites were mapped to putative enhancers (H3K4me1 or LMR) or promoters (H3K4me3 and UMR). Assessing the standard deviation of the methylation status across individuals (a) CpGs mapping to putative enhancers were found to be more variable (median SD=9.4) than (b) those mapping to putative promoters (median SD=1.5).

115 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Supplementary Figure 9. CpG-by-CpG correlation between Illumina 450K array and

MCC-Seq methylation calls in 24 samples Figure 3

8000

6000

4000 Number of CpGs

2000

0

−0.5 0.0 0.5 1.0

Pearson Correlation (R)

The figure shows the distribution of spearman rank correlations for CpGs from the

Illumina 450K array that had average reads coverage above the 20th percentile in the

MCC-Seq method (teal; N=138,067 CpGs; mean R =0.20). The correlations were generally greater when we included only the top 25% variable CpGs (orange;

N=34,517 CpGs; mean R=0.50) or the top 10% variable CpGs (blue; N=13,807 CpGs; mean R=0.58) from the MCC-Seq methylation calls.

116 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Supplementary Figure 10. Comparison of the observed heterozygosity from MCC-Seq and HumanOmni BeadChip array genotyping calls

Comparison between the observed heterozygosity based on the HumanOmni

BeadChip array (SNP ratio, y-axis) and MCC-Seq4-plex (SNP ratio, x-axis) genotyping calls across ≥20 individuals (N=3,093 SNPs; R=0.99); we required sequence coverage

≥5X for MCC-Seq; R is the Pearson correlation coefficient.

117 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

Supplementary Figure 11. Distribution of triglycerides levels in the discovery cohort

8

6

4

Number of Samples 2

0

0.5 1.0 1.5 2.0 2.5 3.0 3.5 Triglycerides (mM)

Distributions of the triglyceride levels (mM, x-axis) across the discovery cohort (N=70, two outliers excluded).

118 CHAPTER 2: CHARACTERIZATION OF METHYLOMES BY NEXT-GENERATION CAPTURE SEQUENCING

2.11.3 Supplementary Data

All Supplementary Data (Data 1-7) can be accessed online from the open access publication Allum et al.117 in Nature Communications using the following link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4544751/

119 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Chapter 3: Dissecting Features of Epigenetic Variants Underlying Cardiometabolic Risk

3.1 Bridging Statement between Chapter 2 and 3

We previously validated MCC-Seq as an alternative but more cost-effective genome- wide investigation tool for simultaneous profiling of methylation status and genotypes over target regions. We support previous findings that metabolic trait- linked epigenetic variants are enriched in distal regulatory regions and, even further, in tissue-specific regions of disease-relevant tissues such as visceral adipose tissue.

Most EWAS studies to date have been completed in whole blood cohorts and so, comparisons of signals replicating from a biologically-relevant tissue to a bioavailable tissue are needed. Expanding on our previous study, we applied MCC-Seq and performed an EWAS of circulating lipid traits in ~200 VAT samples with matched whole blood samples from our collaborators at the IUCPQ, enabling us to highlight key features of tissue-specific and tissue-shared complex trait linked epigenetic signals. We investigated the results from our disease cohort in population-based cohorts through additional collaborations with the MuTHER (UK) and CARTaGENE

(Québec, Canada) consortia. We further provided insight into the impact of genetic effects on methylation status at complex trait-linked regulatory regions and, in particular, focused on annotating lipid-linked GWAS from the large-scale efforts of the Global Lipids Genetics Consortium. We present a catalog of novel lipid-linked regulatory regions of interest for cardiometabolic risk investigations.

120 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

3.2 Title, Authors and Affiliations

Dissecting features of epigenetic variants underlying cardiometabolic risk using full-resolution epigenome profiling in regulatory elements

Fiona Allum1,2, Åsa K Hedman3, Xiaojian Shao1,2, Warren A Cheung1,2,#, Jinchu

Vijay1,2, Frédéric Guénard4, Tony Kwan1,2, Marie-Michelle Simon1,2, Bing Ge1,2,

Cristiano Moura5, Elodie Boulier1,2, Lars Rönnblom6, Sasha Bernatsky5, Mark

Lathrop1,2, Mark I McCarthy7,8,9, Panos Deloukas10, André Tchernof11, Tomi

Pastinen1,2,# Marie-Claude Vohl4, Elin Grundberg1,2,#*

1Department of Human Genetics, McGill University, Montréal, Québec, H3A 0C7,

Canada 2McGill University and Genome Quebec Innovation Centre, Montréal,

Québec, H3A 0G1, Canada

3Cardiovascular Medicine unit, Department of Medicine Solna, Karolinska Institute,

Stockholm, 171 76, Sweden

4Institute of Nutrition and Functional Foods (INAF), Université Laval, Québec,

Québec, G1V 0A6, Canada

5Department of Epidemiology, McGill University, Montréal, Québec, H3A 1A2,

Canada

6Department of Medical Sciences, Uppsala University, Uppsala, 751 85, Sweden

121 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

7Oxford Centre for Diabetes, Endocrinology and Metabolism, University of Oxford,

Churchill Hospital, Old Road, Headington, Oxford, OX3 7LJ, United Kingdom

8Wellcome Centre for Human Genetics, University of Oxford, Roosevelt Drive, Oxford,

OX3 7BN, United Kingdom

9Oxford NIHR Biomedical Research Centre, Oxford University Hospitals NHS

Foundation Trust, John Radcliffe Hospital, Oxford, OX3 9DU, United Kingdom

10William Harvey Research Institute, Barts and The London School of Medicine and

Dentistry, Queen Mary University of London, Charterhouse Square, London, EC1M

6BQ, United Kingdom

11Québec Heart and Lung Institute, Université Laval, Québec, Québec, G1V 0A6,

Canada

#Children’s Mercy Hospitals and Clinics, Kansas City, Missouri, 64108, United States of America

# Current address

* Corresponding author

E-mail: [email protected] (EG)

Published in: Nature Communications 2019 March; 10(1): 1209. doi: 10.1038/s41467-019-09184-z.

122 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

3.3 Abstract

Sparse profiling of CpG methylation in blood by microarrays has identified epigenetic links to common diseases. We apply methylC-capture sequencing (MCC-Seq) in a clinical population of ~200 adipose tissue and matched blood samples (Ntotal~400), providing high-resolution methylation profiling (>1.3M CpGs) at regulatory elements. We link methylation to cardiometabolic risk through associations to circulating plasma lipid levels and identify lipid-associated CpGs with unique localization patterns in regulatory elements. We show distinct features of tissue- specific versus tissue-independent lipid-linked regulatory regions by contrasting with parallel assessments in ~800 independent adipose tissue and blood samples from the general population. We follow-up on adipose-specific regulatory regions under (1) genetic and (2) epigenetic (environmental) regulation via integrational studies.

Overall, the comprehensive sequencing of regulatory element methylomes reveals a rich landscape of functional variants linked genetically as well as epigenetically to plasma lipid traits.

123 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

3.4 Introduction

Complex diseases such as obesity and type 2 diabetes (T2D) are caused by joint action of predisposing genetic and environmental factors118-121. Heritability measures of obesity-related traits such as BMI have shown that the genetic contribution is likely only ~30-40%17 - pointing towards a larger impact than previously estimated by environmental effects.

CpG methylation has been shown to be disrupted in disease states77,78 and by environmental modifiers88,122. As such, assessment of CpG methylation changes through epigenome-wide association studies (EWAS) enables us to connect environment and genetics66,89 to phenotype and disease123. Circulating lipid profiles are clinically applied in cardiometabolic risk assessment121, providing indications of metabolic complications among healthy and obese individuals124. Although past

EWAS efforts have successfully identified lipid-associated loci with roles in metabolic processes82-84,86,125, we have shown the importance of using disease-targeted tissues for functional interpretation of disease loci due to the preferential mapping of identified variants to tissue-specific regulatory elements37,66. This is an important observation considering that most EWAS to-date have studied whole blood tissue using targeted arrays (e.g. Illumina 450K array), which underrepresent distal regulatory regions (e.g. enhancers) and bias towards promoter regions. In fact, promoters are largely uninformative in EWAS due to the invariable state of resident

CpGs across individuals66, partly due to insufficient sensitivity measures in DNA methylation assessments.

124 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

To overcome this limitation, we implemented the methylC-capture sequencing (MCC-

Seq) approach permitting simultaneous methylome and genotype profiling in regulatory regions at high resolution117. A pilot adipose tissue EWAS of triglyceride

(TG) levels identified novel TG-linked methylation variation within enhancers. MCC-

Seq was also applied across various tissues in hundreds of donors and demonstrated stronger enrichment of GWAS SNPs underlying allele-specific methylation within disease-linked tissues – emphasizing the importance of utilizing appropriate tissues to decipher not only epigenetic variants but genetic variants126.

Here, we present a large next-generation sequencing (NGS)-based EWAS applying

MCC-Seq on adipose tissue and blood samples derived from a clinically relevant cohort of obese individuals. We link ~1.3M dynamic CpGs to blood plasma lipids and map positional trends of lipid-linked CpGs within functional elements. We highlight the ability of MCC-Seq to fine-map EWAS signals through replication in the large

MuTHER adipose cohort and apply integrative approaches to identify disease- associated epigenetic variants linked to regulatory effects, further providing insight into metabolic disease etiology. We further show features of the metabolic-disease- linked methylome by assessing the contribution of genetic factors and use these tabulated associations to fine-map cardiometabolic-risk-associated GWAS SNPs.

125 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

3.5 Results

3.5.1 Adipose tissue epigenetic variants linked to plasma lipids

CpG methylation was profiled in visceral adipose tissue (VAT) from 199 severely obese individuals (BMI>40kg m-2; 60% female) undergoing bariatric surgery (IUCPQ,

Université Laval; Supplementary Table 1; Methods). We applied the MCC-Seq protocol querying up to 3.3M CpGs mapping to adipose tissue regulatory regions117

(Methods). We focus on a conservative set of highly covered (33X) and variable sites corresponding to 1.3M CpGs (Methods) that exhibited mainly (55%) hypomethylated states (<20% average methylation) with a smaller proportion (10%) being hypermethylated (>80% average methylation).

We associated CpG methylation at the 1.3M sites in adipose tissue with circulating plasma lipid levels, i.e. triglycerides (TG), HDL-cholesterol (C), LDL-C and total cholesterol (TC) (Methods), applying a generalized linear model accounting for sequencing depth, age and BMI. Controlling for bias and inflation of our test- statistics was achieved using the Bayesian method BACON127, noting an improvement in the inflation factor (lambda) after correction across all trait- associations (Supplementary Figures 1-4). In total, methylation levels at 1,230 (FDR

10%; corrected p<3.52x10-5) and 615 (FDR 5%; corrected p<9.25x10-6) CpGs were associated to at least one lipid trait (Supplementary Figure 5). We subsequently refer to “lipid-CpGs” as those reaching significant lipid associations at FDR 10%

(Supplementary Data 1). Overall, 13% of lipid-CpGs were linked to more than one lipid trait (Supplementary Figure 5). By assessing the inter-individual variability of

126 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

lipid-CpGs, these sites also depicted a more variable state than the full set of 1.3M

CpGs tested (Supplementary Figure 6).

3.5.2 Positioning of lipid-CpGs within regulatory elements

Identified lipid-CpGs were annotated using adipose tissue hypomethylated footprints

- low-methylated regions (LMRs) and unmethylated regions (UMRs)117,128 - as indicators of regulatory elements. We previously characterized these methylated footprints128, showing co-localization of adipose tissue LMRs and UMRs with the

H3K4me1 active enhancer and H3K4me3 active promoter marks, respectively, from primary human adipocytes (NIH Roadmap Consortium). In all subsequent analyses, we refer to LMRs and UMRs as putative enhancers and promoters, respectively. We additionally characterized these adipose tissue regulatory regions in terms of their genomic lengths and discovery CpG densities, where we noted putative enhancers were shorter and less densely populated than promoters (Supplementary Table 2).

Mimicking our previous findings117, lipid-CpGs were enriched in putative adipose enhancers (26% of lipid-CpGs versus 17% in background; Fisher’s exact test throughout; Fisher’s p=6.6x10-13) while being less likely to map to putative promoters

(40% of lipid-CpGs versus 54% in background; Fisher’s p<2.2x10-16; Supplementary

Figure 7). The set of lipid-CpGs was then restricted to include only those mapping to adipose tissue regulatory regions not shared with other tissues (i.e. whole blood;

Methods) and showed stronger enrichment patterns at enhancers (13% of lipid-CpGs versus 7% in background; Fisher’s p=9.9x10-13). Additionally, we noted a reversal of

127 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

trends as lipid-CpGs were enriched in adipose-specific promoters (10% of lipid-CpGs versus 6% in background; Fisher’s p=8.1x10-11; Supplementary Figure 7). Of note, these localization patterns appear to be independent of CpG methylation variability at interrogated sites (Fisher’s p<1.1x10-7; top 25th percentile; Supplementary Figure

7). In total, we identified 264 putative adipose enhancers (LMRs) and 303 promoters

(UMRs) harboring lipid-CpGs, of which 341 are shared elements and 226 are adipose- specific elements. These 567 regulatory elements were carried forward for further analyses (Supplementary Data 1; Fig. 1).

Given the high-density coverage of CpG methylation obtained through MCC-Seq, we investigated differences in positional trends of lipid-CpGs within adipose tissue hypomethylated footprints (Methods). Focusing first on all discovery CpGs mapping to the 264 LMRs, lipid-CpGs located more towards the mid-point of putative enhancers compared to all CpGs (Fig. 2A). CpGs locating to UMRs (within +/-1.5Kb of a transcription start site (TSS); 139/303 UMRs) exhibited a bimodal distribution flanking the TSS similar to the background with a slight peak shift downstream of the TSS further into the gene body (Fig. 2B). To rule out potential technical biases explaining these observations, we assessed the mean coverage of reads within these elements and found that although lipid-CpGs have higher coverage than all assessed

CpGs, the coverage does not differ based on the position of lipid-CpGs within the elements per se (Supplementary Figure 8).

Next, we contrasted the capacity of MCC-Seq to capture lipid-CpGs within adipose tissue regulatory regions over alternative methods such as the Illumina 450K82-

128 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

84,86,125 and EPIC arrays. As a whole, the EPIC and 450K arrays captured only 17% and 6% of the total percent CpGs profiled in LMRs by MCC-Seq and 29% and 19% of those mapping to UMRs, respectively. These percentages dropped further when focusing on CpGs typed on the array-based methods directly overlapping MCC-Seq

CpGs (Supplementary Table 3). Positional trends of CpGs in both arrays showed a depletion of coverage within putative promoters downstream of the TSS

(Supplementary Figure 9) - regions towards the gene body where we showed lipid-

CpGs to be enriched (Fig. 2B).

3.5.3 Replication of lipid-linked adipose regulatory regions

We then validated the 567 adipose regulatory regions mapping with lipid-CpGs in the

MuTHER cohort (N~650 individuals) where subcutaneous adipose tissue CpG methylation levels were profiled on the 450K array66 and associated to the same lipid traits under investigation (TG, HDL-C, LDL-C and TC)86. Of the 567 highlighted regulatory regions, only 365 (64%) were covered by the 450K array. In line with design biases of the 450K array, a higher proportion of adipose tissue promoter regions

(269/303 UMRs; 89%) than enhancer regions (96/264 LMRs; 36%) contained at least one 450K array CpG. Using Bonferroni cutoff (taking into account each trait individually and with same direction of effect), we found the highest replication rate for TG-UMRs where 17% (13/76) of the regions were also associated with TG in the validation cohort. All replicated regions (N=21) are presented in Supplementary Data

2.

129 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

To assess the potential of MCC-Seq to fine-map EWAS signals, we focused on the 16 of 21 replicated regulatory regions containing at least 2 discovery lipid-CpGs where one of these overlapped the top MuTHER lipid-CpG. Here, 15/16 (94%) elements harboured stronger lipid associations at discovery CpGs that didn’t directly overlap the top MuTHER lipid-CpG positions (Supplementary Data 2). We then investigated the localization of the “fine-mapped” discovery lipid-CpGs compared to their nearby

MuTHER lipid-CpGs within the adipose tissue regulatory elements. All the “fine- mapping” discovery CpGs located at the mid-point of adipose tissue LMRs (+/-20% from mid-point), representing a slight increase in proportion over their paired

MuTHER CpGs (2/3 CpGs). This pattern is similar to the observed positional mapping trends for the full set of lipid-CpGs at LMRs, which exhibited a mid-point shift compared to all CpGs assessed (Fig. 2A; Fig. 2C). Likewise, “fine-mapping” discovery CpGs mapping to adipose UMRs showed that these CpGs tended to locate in greater numbers (7/12; 58%) than their paired MuTHER CpGs (5/12 CpGs; 42%) within the bimodal positional peaks (+20% to +45% or -20% to -45% from mid-point) previously observed for lipid-CpGs at UMRs (Fig. 2B; Fig. 2D). Both of these fine- mapping trends did not reach nominal significance most likely owing to the small number of observations and the additional bimodal pull of the fine-mapping exhibited at the putative promoter regions.

130 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

3.5.4 Functional annotation of lipid-CpGs

Replicated lipid-linked adipose hypomethylated regulatory regions were characterized by performing transcription factor binding site (TFBS) motif analyses

(Methods). Focusing on replicated UMRs harboring lipid-CpGs (N=16 regions) and excluding LMRs due to their small number (N=5), TFBS linked to adipogenesis and/or obesity related metabolic-complications were enriched, with members of the STAT family129-131 STAT5A132, STAT1 and STAT3133 being most significant, followed by

NFIB134,135 and RUNX1136,137 (Supplementary Table 4). We further noted that

STAT5A, STAT3 and NFIB showed higher levels of expression in adipose tissues over whole blood in the GTEx Consortium data (GTEx portal; November 2017;

Supplementary Figures 10-12) with the strongest evidence for NFIB expression. We confirmed adipocyte-specific expression of NFIB through differential expression analyses of purified human adipocytes from both subcutaneous and visceral depots versus various blood cell types (>14.0-fold change; p<3.98x10-236; Methods).

Next, replicated lipid-linked adipose tissue regulatory regions (N=5 LMRs; N=16

UMRs) were functionally annotated by incorporating matching adipose tissue gene expression data from the MuTHER cohort66 (Methods). As many as 16/21 (76%) lipid- associated regions showed significant association between the methylation status of one of their resident CpGs and the expression levels of at least one cis-located gene

(FDR 10%; range 1-9 associated genes/region; within +/-1Mb; Supplementary Data 3)

– representing a 1.9-fold change in effect over all testable regulatory regions

(10,141/26,050 regions; Fisher’s p=0.00104). All 16 regulatory regions depicting

131 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

associations to at least one gene also exhibited stronger effects on gene expression at non-adjacent genes – with an average absolute distance of ~522kb to their most correlated gene compared to ~33kb to the transcribed region of their most proximal gene. A greater proportion of these replicated lipid-associated regulatory regions

(11/16 regions; 69%) correlated to the expression levels of more than one gene compared to the background (4,673/10,141 regions; 46%; Fisher’s p=0.08).

We assessed whether the genes (N=44) for which expression levels were associated with methylation status at replicated lipid-linked regions (N=16) were also independently linked to the same plasma lipid phenotypes (Methods). As many as

77% (30/39) of testable genes linked to 15 replicated lipid-associated regulatory regions showed additional association to the same lipid trait under investigation in the expected direction of effect (Supplementary Data 4).

Restricting to genes listed in the GWAS SNP catalogue (N=20/30; accessed September

2018), we observed that 6/20 (30%) genes associating to lipid-linked regulatory regions also showed association to metabolic-related phenotypes, revealing an enrichment of obesity-linked traits compared to the full catalog (692/15,815 genes;

4%; 6.9-fold change; Fisher’s p=0.00016; Supplementary Data 4; Methods). Ingenuity pathway analysis (Methods) of all 30 highlighted genes showed G!q Signaling as the most significantly associated function within this gene set (p=6.94x10-5;

Supplementary Table 5). Interestingly, two of the four genes mapped to this pathway were regulated by the same lipid-regulatory element which we follow-up in more detail below.

132 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

3.5.5 Tissue-specificity of lipid-linked regulatory regions

To gain insight into the potential tissue-specific nature of epigenetic signatures associated to disease, we interrogated whether lipid-linked signals mapping to regulatory regions are detectable across tissues within a study population by profiling

CpG methylation in whole blood from a matching set of samples (N=206) from the obese IUCPQ cohort (Supplementary Table 1). We linked whole blood methylation status to the same circulating plasma lipid levels (Methods) and successfully typed

565 out of the 567 regulatory regions harbouring discovery adipose tissue lipid-CpGs in whole blood, of which 340 were shared and 225 adipose-specific (i.e. not shared to whole blood) elements (Methods). Globally at the same significance threshold (using

Bonferroni cutoff for each trait individually and with same direction of effect), lipid- associations at shared regulatory elements replicated at a significantly higher rate

(46/340 replicated lipid-linked regions; 14%) than adipose-specific elements (12/225 replicated regions; 5%; Binomial test p=9.0x10-9). Lipid-associations at shared putative promoters (i.e. UMRs) were more likely to replicate across tissues than at shared enhancer regions - with 35/221 (16%) lipid-linked UMRs compared to 11/119

(9%) LMRs replicating in whole blood. Specifically, we were able to validate associations at 4/39 (10%) TG-LMRs, 2/39 (5%) HDL-LMRs, 7/34 (21%) LDL-LMRs,

4/26 (15%) TC-LMRs, 10/64 (16%) TG-UMRs, 7/69 (10%) HDL-UMRs, 11/77 (14%)

LDL-UMRs and 14/57 (25%) TC-UMRs in whole blood (Supplementary Data 5).

Previous studies have indicated the importance of accounting for differences in biological outcome of environmental and genetic effects on DNA methylation at the

133 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

tissue level86, thus we performed the replication across adipose tissue to whole blood by also allowing different directions of effect across tissues. Here, we were able to validate additional associations at 1/39 (3%) HDL-LMRs, 1/34 (3%) LDL-LMRs, 6/64

(9%) TG-UMRs, 4/69 (6%) HDL-UMRs, 10/77 (13%) LDL-UMRs and 3/57 (5%) TC-

UMRs in whole blood (Supplementary Data 5). Taken together, we identified 68 adipose tissue regulatory regions (13 putative enhancers and 55 promoters) showing evidence for tissue-shared lipid-associations.

Pathway analysis of the 52 genes directly overlapping the 68 tissue-independent regulatory regions (Supplementary Table 6) revealed the adipogenesis pathway as the top significantly associated function (IPA p=3.1x10-3; Methods). Among the genes highlighted within this pathway, we noted (1) the serine/threonine kinase AKT1 overlapping a shared promoter region (chr14:105260438-105262714) harboring CpGs positively correlated to both LDL-C and TC levels; (2) the histone deacetylase HDAC4 mapping with an intergenic enhancer region (chr2:240240338-240241584) containing

CpGs depicting negative associations to HDL-C in adipose tissue that were reversed in whole blood; (3) BMP4 overlapping a shared promoter region (chr14:54418956-

54424030) where CpGs were negatively associated to TG levels. We further highlighted lipid-associated promoter regions at the following cardiometabolic risk- related loci; growth factor GDF7, kinase CERK, VGLL3 and ATP-binding cassette transporter ABCC5.

Next, we investigated how the lipid-linked and tissue-shared regulatory regions identified in a clinical population associate with the same traits independently of

134 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

obesity status. CpG methylation was profiled by MCC-Seq in whole blood from a population-based (N=137) cohort (CARTaGENE; https://cartagene.qc.ca/;

Supplementary Table 1), again linking whole blood methylation status to circulating plasma lipid levels (Methods). Overall, we found 22/68 (32%) regions to be associated with the same lipid trait under investigation in the population-based cohort

(Supplementary Data 6). However, contrasting adipose lipid-associations that replicated in whole blood with the same (N=46 regions) versus opposing (N=28 regions) directions of effects (N=17/46 regions; 37% vs. N=5/28 regions; 18%) showed a marked difference in the replication rate (>2-fold change) indicating the possibility of the latter being more specific to the clinical condition.

3.5.6 Genetic contribution to lipid-CpG methylation variability

We previously validated the ability and accuracy of MCC-Seq to provide genotyping information over target regions117, which we used here to study genetic effects on CpG methylation. Using this inferred genetic dataset, we integrated recently tabulated

SNP-CpG associations (metQTL) in cis (+/-250kb) for a subset of the adipose discovery cohort126. First, we confirmed our previous findings66 that SNPs associated with CpG methylation are enriched in the vicinity of their linked CpGs (Supplementary Figure

13). Second, we investigated the level of genetic regulation among lipid-associated regulatory regions and noted a large fraction to be partly under genetic regulation.

In line with previous studies66,86,138, we observed that 64% (362/567) of lipid- associated elements depicted a significant SNP-CpG association (FDR 10%)

135 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

compared to only 44% (22101/50759) in the background (Fisher’s p<2.2x10-16; Table

1). We further found that this enrichment was maintained when accounting for overall methylation variability (top 25% variable CpG methylation status across all individuals; 194/406 lipid-linked regions versus 4763/17593 in background; Fisher’s p<2.2x10-16).

We queried whether the identified lipid-linked regulatory regions have different levels of genetic contribution depending on their tissue-specificity and contrasted the elements unique to adipose (N=226) versus those shared across tissues to whole blood

(N=341; Table 1). We observed an enrichment in association to cis-SNPs only at shared regulatory elements (N=251/341 regions; 74%; Fisher’s p<2.2x10-16; Table 1).

Restricting to the subset of 68 lipid-associated shared regulatory regions that were further validated in the matched whole blood cohort, we noted an increase in observed genetic variation contribution corresponding to as much as 93% (N=63/68 regions;

Fisher’s p<2.2x10-16; Supplementary Data 5; Table 1). Finally, we further filtered the list of lipid-linked regulatory regions to only contrast those that in addition to being validated in the matched whole blood cohort were also significantly associated to lipids in the independent population-based cohort (Supplementary Data 6). Here, we found a striking enrichment with 21/22 (95%) of these tissue-independent and obese- status-independent regions to be under genetic regulation (Fisher’s p=3.3x10-7).

To assess whether these genetically controlled lipid-linked epigenetic loci overlap

GWAS loci, we incorporated GWAS SNPs for the same four lipid traits under study from the large-scale efforts of the Global Lipids Genetics Consortium2. We focused on

136 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

lead SNPs associated with methylation of CpGs mapping to the 362 lipid-linked regulatory regions. Intersecting these SNPs and/or their proxies (r2>0.8) with the fully released dataset of GWAS SNPs at nominal significance, we noted an enrichment at lipid-linked regulatory regions for all lipid traits; TG (3.7-fold; Fisher’s p=3.4x10-16), HDL-C (4.4-fold; Fisher’s p<2.2x10-16), LDL-C (4.3-fold; Fisher’s p<2.2x10-16) and TC (4.1-fold; Fisher’s p<2.2x10-16). Enrichment trends were maintained at a more stringent significance cutoff for GWAS SNPs (p=5.0x10-8) albeit with lower statistical confidence due to smaller numbers (Fisher’s p<0.05).

3.5.7 Regulation of lipid-linked adipose-specific enhancers

Genetic regulation of lipid-linked regulatory elements is pronounced among regions shared across tissue to whole blood whereas adipose-specific regions exhibited a larger component of environmentally-driven regulation. Specifically, we found no evidence of genetic associations for 115/226 (51%) lipid-linked regulatory regions active in adipose but not whole blood. Among these lipid-linked regulatory regions with non-genetic regulatory effects, we followed-up on an adipose-specific putative enhancer (chr19:2332094-2333076) harboring adipose lipid-CpGs linked to TG levels in our discovery and replication samples. (Supplementary Data 2). This enhancer region maps to the first intragenic region of SPPL2B – a locus with no reported associations to cardiometabolic risk (Fig. 3A; Fig. 3B). We initially highlighted the region for harboring TG-linked methylation in the discovery cohort near the mid- point of the enhancer region (chr19:2332436; corrected p=2.4x10-5; Fig 3B) –

137 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

mimicking positional trends for lipid-CpGs at this type of element. The positive correlation of methylation to TG levels at this region was validated in the large- population based MuTHER cohort at nearby CpGs (cg05660874; p=5.1x10-10; cg10723746; p=1.0x10-8; Supplementary Data 2; Fig. 3B). Confirming earlier results for the characterization of adipose putative enhancers128, overlapping the intragenic region with adipocyte-specific H3K4me1 and H3K4me3 (Roadmap; donor 92) showed co-localization of the highlighted adipose-specific LMR with the H3K4me1 enhancer mark (Fig. 3B). This was not observed in peripheral blood (Roadmap; donor TC015)

ChIP-Seq data as H3K4me1 peaks were absent, indicating the adipose-specific nature of the regulatory marks (Fig. 3B). This observation corroborates the lack of replication of epigenetic regulation from whole blood EWAS at this element

(Supplementary Data 5; Fig. 3B).

Integrating the MuTHER cohort expression data (Methods) revealed a lack of significant epigenetic-association to expression levels of the SPPL2B locus. In line with the trend reported above, we instead noted that expression levels of GNA15 - located 803kb downstream of the putative enhancer region - were the most correlated

(ILMN_1773963 vs. cg10723746; p=1.5x10-17; ILMN_1773963 vs. cg05660874; p=1.5x10-16; Fig. 3D). We further observed links to expression levels of GNG7 (179kb downstream), REEP6 (834kb upstream) and MKNK2 (281kb upstream;

Supplementary Data 3; Fig. 3A). Expression levels of these four genes were also associated with TG levels in the MuTHER cohort, with GNA15 and GNG7 exhibiting the strongest relationships (GNA15; ILMN_1773963; p=1.5x10-18; GNG7;

138 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

ILMN_1728107; p=1.2x10-12; Supplementary Data 4; Fig. 3C; Fig. 3D) - corroborating the link between regulation of these loci with levels of TG and the disease state.

Supporting a co-regulation network between these genes is the strong correlation between the 450K array probes located at several regulatory regions at these genes and the expression products of GNG7 and GNA15 interchangeably (Supplementary

Data 7; Fig. 3C; Fig. 3D). GNA15 and GNG7 both encode for G-protein subunits with suggested roles for GNA15 in heart failure139 and glucose homeostasis140 and for

GNG7 in coronary artery calcification141 (Supplementary Data 4). Both of these genes mapped to the top IPA disease-linked function of G!q Signaling for genes under regulation by replicated lipid-linked regulatory regions (IPA p=6.94x10-5;

Supplementary Table 5). Taken together, this may suggest that the identified adipose-specific regulatory region has pleiotropic effects regulating both GNA15 and

GNG7 expression, resulting in additive disease risk.

Although we observed a lack of enrichment for genetic associations among lipid- linked regulatory regions active in adipose but not whole blood, we identified 111 regions under genetic regulation. To exemplify this, we focused on an element mapping to an intragenic region of GALNT2 (chr1:230312462-230313455) showing both epigenetic and genetic associations to HDL-C (Fig. 4; Supplementary Data 1).

Specifically, we showed that this lipid-linked regulatory region (corrected p=2.0x10-5) is under tight genetic regulation with seven CpGs associating to multiple SNPs

(N=21) flanking this element (Supplementary Data 8; Fig. 4B). These lead SNPs were in high LD (r2>0.9) with an HDL-linked GWAS SNP2 (Global Lipids Consortium;

139 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

rs627702; p=5.0x10-24) located 11kb downstream of the enhancer (Fig 4). Of note, this

HDL-linked GWAS SNP was independent of the top GWAS SNP reported by the

Global Lipids Consortium study for this same trait, which locates upstream of the enhancer region (rs4846914; p=4.0x10-41; Fig. 4A)2. Genetic effects at this enhancer were supported by conditional analysis where absence of lipid-CpG association was noted when genotypes were included in the model with rs2760537 being the most prominent (corrected p=4.3 x10-2; q=0.78; Methods). Dissecting results with whole blood EWAS showed the adipose-specific nature of HDL-association at this region

(Fig. 4). This enhancer is also not covered on the 450K array, representing a novel avenue for HDL-association to epigenetic variants. In addition, we found no evidence of cis-eQTLs (GTEx Consortium) linking genetic variants at this locus to gene expression (Fig. 4B). This observation in combination with the lack of a strong adipocyte-specific H3K27ac signature at this enhancer indicates a possible poised or primed region state, supporting efforts highlighting the superior molecular value provided by epigenetics traits over gene expression alone38. The glycosyltransferase

GALNT2 locus itself has previously been associated to metabolic syndrome142, TG levels3,143 and type 2 diabetes144, with our current results supporting additional links to cardiometabolic disease through putative epigenetic regulation.

140 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

3.6 Discussion

We recently introduced MCC-Seq117 as a cost-effective and flexible platform for simultaneous DNA methylation and genotype interrogation in large-scale cohorts, permitting targeted and dense profiling of active methylomes within disease-relevant tissues. Here, we apply MCC-Seq in a comprehensive epigenome-wide study of plasma blood lipids (including TG, HDL-C, LDL-C and TC) and identify 567 lipid- linked regulatory regions in visceral adipose tissue. We combine a stringent statistical correction method (BACON) with a more lenient FDR threshold (10%) and perform detailed follow-ups on regions replicating across adipose tissue depots and across tissue types (whole blood) using the classical Bonferroni approach. This strategy allowed us to present an expanded resource of cardiometabolic risk-linked epigenetic loci.

We confirm current epigenomics trends where tissue-specific regulatory regions such as enhancers globally appear more likely to contain trait-linked CpGs compared to promoters - emphasizing the importance of targeting these regions to expand our understanding of complex disease biology. The observed underrepresentation of lipid- associated epigenetic variants within promoters may be attributable to the more tightly regulated and static nature of these elements, where smaller variations may be biologically impactful but harder to statistically identify. Using MCC-Seq for dense single-base resolution profiling at regulatory elements (7 CpGs/LMR and 37

CpG/UMR, respectively) is advantageous by depicting unique positional trends of lipid-associated epigenetic variants. Key differences are observed in positioning at

141 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

putative enhancers in contrast to promoters: lipid-associated epigenetic variants show clear enrichment at the mid-point of enhancers whereas they depict a bimodal distribution flanking the TSS of promoters. These observations may reflect the TFBS landscape within regulatory regions with preferential binding of TFs at midpoints or edges dependent on elements. Comparisons of these full-resolution positional trends with those captured by array-based approaches exemplified the limitations of the latter methods to assess CpGs within regulatory regions both in terms of the number of CpGs covered and ascertainment biases due to probe design.

We further demonstrate that NGS-based high resolution CpG profiling in epigenome- wide studies allows for fine-mapping of trait-linked epigenetic signals from large- scale array-based studies. Due to limitations in visceral adipose tissue cohort availabilities, the large MuTHER subcutaneous adipose tissue cohort was used as the best alternative proxy of available data for replication studies, therefore we focused on epigenetic variants stable across these two adipose tissue depots. We highlight a high-confidence set of 21 adipose-specific regulatory regions associated with plasma lipid levels. Identified signals for >90% of lipid-associated regulatory regions were refined, with “fine-mapping” discovery CpGs mimicking positional trends highlighted at adipose regulatory elements. We hypothesize that differences in study design both in terms of adipose tissue depots (visceral versus subcutaneous) and cohort selection

(obese versus population-based) between the discovery study and MuTHER cohorts, respectively, may contribute to the observed replication rate. Nevertheless, our TFBS analysis provided insight into potential underlying signaling pathways. Specifically,

142 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

binding motifs for NFIB were enriched at adipose tissue promoter regions mapping with lipid-associated epigenetic variants. Interestingly, NFIB has previously been reported to function in glucose transport134 and also serves as an important regulator of proper adipocyte differentiation as exemplified by preferential mapping to adipocyte- or preadipocyte-specific open chromatin peaks134,135.

While MCC-Seq represents added fine-mapping value for full-resolution methylome assessment, past profiling efforts within the MuTHER cohort have provided us with rich array-based datasets. Linking methylation, expression and phenotype profiles across ~600 adipose tissue samples, we identify lipid-linked replicated adipose tissue regulatory regions associating to plasma lipid traits and expression levels at unique loci that associate to the same lipid traits. We highlight several obesity-related GWAS loci - CSK145, SLCO3A1146, GNG7141 and GNA15139,140 - and report several novel genes including LCN2, ECHS1, IDH2 and CD7. These genes also map to metabolic disease- linked pathways such as the highlighted G!q Signaling known to have a role in adipogenesis through its action in regulating intracellular calcium levels and downstream expression of the master regulators PPARγ and C/EBPα147,148.

CpG methylation is seen as a proxy linking genetics and environment to disease and phenotype. To further contribute to our understanding of genetic and non-genetic factors impacting complex diseases, we dissect lipid-linked regulatory regions through adipose SNP-CpG associations126 within the same cohort. As previously observed66,86,138, a large fraction of lipid-associated regulatory elements is under genetic regulation. These genetic effects are strengthened when restricting to tissue-

143 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

independent and lipid-linked regions replicating to whole blood within the same disease-cohort (93%) as well as across cohorts (95%) – hinting at a coordinated mechanistic regulation over these regions across tissues. We highlight an adipose tissue specific putative enhancer on chromosome 1 - locating within the first intron of the obesity-linked GALNT23,142-144 where methylation levels at this regulatory region are under genetic control by variants within an HDL-linked GWAS locus2.

Adipocyte-specific histone marks at the locus suggest that the HDL-linked regulatory region represents an adipose-specific poised enhancer and may explain why genetic regulation of this disease locus has not been identified by large eQTL efforts such as from the GTEx Consortium. This finding highlights the importance of studying epigenetic marks such as DNA methylation over gene expression alone.

Building on a previous study66, we also present an expanded methylation-expression association analysis, permitting us to assess pleiotropic effects of adipose tissue regulatory regions showing association to cardiometabolic risk factors. In line with current chromatin conformational studies, we report that the methylation status at lipid-linked regulatory regions show stronger associations to the expression levels of genes locating ~500kb away, on average. A majority (~70%) of these lipid-linked regulatory elements exhibit putative pleiotropic effects – indicating the occurrence of regulatory networks linked to the disease state. We focus on an adipose tissue-specific

TG-linked enhancer region on showing strong putative effects on the expression levels of two G!q Signaling genes - glucose homeostasis-linked GNA15140 and coronary artery disease-linked GNG7141 located >200kb upstream of the element.

144 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

We support observed TG-associations at this region through our three-way associations of methylation, expression and lipids within the MuTHER cohort. We also present evidence for a co-regulation network between regulatory regions mapping to GNA15 and GNG7 and the adipose enhancer of interest. Taken together, this may suggest that the identified adipose-specific regulatory region has pleiotropic effects regulating expression of both GNA15 and GNG7 resulting in additive disease risk.

In conclusion, our study demonstrates the advantage of NGS-based methylome profiling in disease-relevant tissues to identify complex trait-linked epigenetic variants at high resolution. We show that targeted sequencing approaches enables us to refine methylome landscape features and to further disentangle the genetic versus environmental contributions to complex traits. Our study represents an expanded dataset of cardiometabolic-risk-linked epigenetic regulatory regions in the disease- relevant adipose tissue. Our findings confirm that integrating cellular phenotypes with disease traits across tissues enables the identification of functional epigenetic variants in regulatory regions linked to complex disease traits.

145 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

3.7 Methods

3.7.1 Sample collections

We obtained 199 visceral adipose tissue (VAT) samples (males N=79; females N=120) from the Quebec Heart and Lung Institute for our discovery cohort (IUCPQ;

Université Laval, Quebec City, Canada). Samples were collected between June 2000 and July 2012 for 1,906 severely obese (BMI >40 kg m-2) men (N=597) and women

(N=1,309) undergoing biliopancreatic diversion with duodenal switch106 at this

Institute as previously described107. Briefly, subjects fasted overnight before the surgical procedure. Anesthesia was induced by a short-acting barbiturate and maintained by fentanyl and a mixture of oxygen and nitrous oxide. VAT samples were obtained within 30 min of the beginning of the surgery from the greater omentum107.

We additionally obtained 206 whole blood samples from the same IUCPQ cohort described above for dissection of adipose epigenetic variants. Blood was collected before surgery.

The sample collection was approved by the Université Laval and McGill University

(IRB FWA00004545) ethics committee and performed in accordance with the principles of the Declaration of Helsinki. Tissue banking and the severely obese cohort were approved by the research ethics committees of the Quebec Heart and

Lung Institute. All participants provided written informed consent before enrolment in the study.

We included 137 whole blood samples from the CARTaGENE cohort

(https://cartagene.qc.ca/) in the study design for dissection of adipose epigenetic

146 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

variants. As a whole, the CARTaGENE cohort numbers ~20,000 general population subjects drawn from the province of Québec, Canada. Using bio-banked serum from a random subset (N=3,600) of the CARTaGENE cohort, ACPA (anti-citrullinated protein antibody) positive subjects (N=69; 18 with high titres=>60 units, the others with medium titres= 20-59 units) were identified by an enzyme-linked immunosorbent assay (Quanta Lyte, CCP3 IgG: Inova Diagnostics Inc., San Diego,

CA). Age and sex-matched ACPA negative subjects (N=68) were randomly selected. ACPA status was not considered as a covariate in this study.

The methylation studies of the samples from CARTaGENE were approved by the

McGill University institutional review board, IRB number A04-M46-12B. All participants provided written informed consent before enrolment in the study.

BMI was calculated as weight in kilograms divided by height in meters squared.

Plasma total cholesterol (TC), triglyceride (TG) and high-density lipoprotein cholesterol (HDL-C) levels were measured using enzymatic assays. HDL-C was measured in the supernatant following precipitation of very low-density lipoproteins and low-density lipoproteins with dextran sulphate and magnesium chloride. Plasma low-density lipoprotein cholesterol (LDL-C) levels were estimated with the

Friedewald formula. Summary of the characteristics are tabulated in Supplementary

Table 1.

147 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

3.7.2 MCC-Seq methylation profiling

Genomic DNA was extracted from the blood buffy coat using the GenElute Blood

Genomic DNA kit (Sigma, St. Louis, MO, USA) and quantified using both NanoDrop

Spectrophotometer (Thermo Scientific) and PicoGreen DNA methods. The samples were profiled through targeted methylation sequencing as previously described117.

Briefly, in MCC-Seq a whole-genome sequencing library is prepared and bisulfite converted, amplified and a capture enriching for targeted bisulfite-converted DNA fragments is carried out. The captured DNA is further amplified and sequenced. More specifically, whole-genome sequencing libraries were generated from 700 to 1,000 ng of genomic DNA spiked with 0.1% (w/w) unmethylated λ DNA (Promega) previously fragmented to 300–400 bp peak sizes using the Covaris focused-ultrasonicator E210.

Fragment size was controlled on a Bioanalyzer DNA 1000 Chip (Agilent) and the

KAPA High Throughput Library Preparation Kit (KAPA Biosystems) was applied.

End repair of the generated dsDNA with 3′- or 5′-overhangs, adenylation of 3′-ends, adaptor ligation and clean-up steps were carried out as per KAPA Biosystems' recommendations. The cleaned-up ligation product was then analysed on a

Bioanalyzer High Sensitivity DNA Chip (Agilent) and quantified by PicoGreen (Life

Technologies). Samples were then bisulfite converted using the Epitect Fast DNA

Bisulfite Kit (Qiagen), according to the manufacturer's protocol. Bisulfite-converted

DNA was quantified using OliGreen (Life Technologies) and, based on quantity, amplified by 9–12 cycles of PCR using the Kapa Hifi Uracil+DNA polymerase (KAPA

Biosystems), according to the manufacturer's protocol. The amplified libraries were

148 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

purified using Ampure Beads and validated on Bioanalyzer High Sensitivity DNA

Chips, and quantified by PicoGreen. SeqCap Epi Enrichment System protocols

(Roche NimbleGen) were then carried out for the capture step using the previously presented adipose-specific custom panels117 MetV1 (N=113 discovery adipose samples), MetV2 (N=92 discovery adipose samples; N=206 whole blood IUCPQ cohort samples) as well as a whole blood-specific custom panel126 (N=137 CARTaGENE cohort samples). The hybridization procedure of the amplified bisulfite-converted library was performed as described by the manufacturer, using 1 µg of total input of library, which was evenly divided by the libraries to be multiplexed, and incubated at

47 °C for 72 h. Washing and recovering of the captured library, as well as PCR amplification and final purification, were carried out as recommended by the manufacturer. Quality, concentration and size distribution of the captured library was determined by Bioanalyzer High Sensitivity DNA Chips. Captures were sequenced on the Illumina HiSeq2000/2500 system using 100-bp paired-end sequencing.

Reads were aligned to the bisulfite converted reference genome using BWA v.0.6.192.

We removed (i) clonal reads, (ii) reads with low mapping quality score (<20), (iii) reads with more than 2% mismatch to converted reference over the alignment length, (iv) reads mapping on the forward and reverse strand of the bisulfite converted genome,

(v) read pairs not mapped at the expected distance based on library insert size, and

(vi) read pairs that mapped in the wrong direction as described by Johnson et al.105.

To avoid potential biases in downstream analyses, we applied our benchmark filtering

149 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

criteria as follows; ≥ 5 total reads, no overlap with SNPs (dbSNP 137), ≤ 20% methylation difference between strands, no off-target reads and no overlap with DAC

Blacklisted Regions (DBRs) or Duke Excluded Regions (DERs) generated by the

ENCODE project:

(http://hgwdev.cse.ucsc.edu/cgi-bin/hgFileUi?db=hg19&g=wgEncodeMapability).

Methylation values at each site were calculated as total (forward and reverse) non- converted C-reads over total (forward and reverse) reads. CpGs were counted once per location combining both strands together. We restricted the analyses to CpGs covered in at least 100 individuals for the IUCPQ cohorts and 50 individuals for the

CARTaGENE cohort (due to the smaller cohort size) with more than 10% of these having methylation status above zero and below 100%.

3.7.3 Epigenome-wide association of plasma lipid levels

We tested associations between methylation levels of CpGs detected by MCC-Seq with circulating lipid levels (TG, HDL-C, LDL-C and TC) from the corresponding cohorts using a generalized linear model (GLM) function implemented in R3.1.1.

Outliers in lipid levels were identified by setting a cutoff of mean±3*SD and removed from further analysis. Lipid levels not depicting a normal distribution were converted to the log scale (adipose IUCPQ: TG; whole blood IUCPQ: TG and HDL-C;

CARTaGENE: TG, HDL-C and LDL-C). The response variable (methylation levels) was fitted to a binomial distribution weighted for sequence read coverage at each site and adjusted (1) for age, sex, MCC-Seq panel batch effect and BMI for discovery

150 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

cohort adipose samples, and (2) for age, sex, blood cell proportions and BMI for the whole blood IUCPQ and CARTaGENE cohort samples. We remove bias and inflation by applying the BACON correction127 on the test statistics using default parameters.

False-discovery rate (FDR) was calculated with the R/Bioconductor q-value package149 for each trait individually in the adipose IUCPQ cohort. We set the significance level at FDR 10%. Bonferroni cutoff was used as a significance threshold for dissection with the whole blood cohorts for each trait individually.

Subcutaneous adipose tissue methylation data from a population-based cohort of 648 female individuals in the TwinsUK/MuTHER cohort was obtained for replication. The samples were profiled on the Illumina 450K array and normalized as described previously66. Associations between 450K array methylation data (N=355,296 CpG probes) and the four circulating lipid levels under investigation were previously assessed86 using a linear mixed model taking into account familial relationship, twin zygosity and other cofactors into account (i.e. age, beadchip, BS conversion efficiency,

BS-treated DNA input and BMI – expect when assessing BMI itself). Bonferroni cutoff was used as a significance threshold for validation for each trait individually.

3.7.4 Positional mapping analyses

We defined un-methylated (UMR) and low-methylated regions (LMR) by mining through whole-genome bisulfite sequencing datasets from adipose and whole blood samples from the same cohort, separately, as described previously117,128. Through these efforts, we reported 20,195 UMRs and 45,065 LMRs for adipose tissue and

151 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

19,871 UMRs and 46,159 LMRs for whole blood samples117,128. Adipose-specific regions were previously defined by intersecting adipose and whole blood hypomethylated regions, where 2,342 and 24,687 adipose-specific UMRs and LMRs were tabulated, respectively117.

Positional trends of CpGs within adipose regulatory elements were assessed restricting to LMRs containing at least 1 CpG (N=31,964) and UMRs containing at least 1 CpG and within +/-1.5kb of transcription start sites (TSS) as well as not depicting bivalent gene transcription orientations (N=10,924). Position of CpGs were tabulated as the percent distance from the midpoint of elements (genomic distance from midpoint (bp)/length of element(bp)*100) and collapsed to make density plots using ggplot2150 to summarize positional trends over all assessed elements. Gene orientation was additionally taken into account for CpGs mapping to UMRs where

UMRs were positioned upstream of genes.

3.7.5 Transcription factor binding site motif analysis

Transcription factor binding site (TFBS) motif analysis was performed using the

Homer software151 for lipid-linked UMRs (N=16 regions) replicated in the MuTHER cohort where we excluded replicated lipid-linked LMRs due to their small number

(N=5 regions). Default settings were selected with the “given” size option. UMRs harboring replicated lipid-associated CpGs were contrasted against the remaining promoter regions containing interrogated CpGs that lacked nominal significance in the discovery EWAS for any of the four lipid traits (N=912 UMRs). A Bonferroni

152 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

q<0.05 cutoff was applied for significance.

3.7.6 Differential expression analyses

Peripheral blood mononuclear cells were purified from buffy coats originating from

450 ml blood of healthy blood donors (Uppsala Blood Transfusion Center, Uppsala

University Hospital, Sweden), using Ficoll-Paque (GE Healthcare) density-gradient centrifugation. B cells, T cells and monocytes were isolated from dedicated batches of peripheral blood mononuclear cells, using positive selection with CD19+, CD3+ and

CD14+ beads (Miltenyi Biotec), respectively, according to the manufacturer's instructions.

RNA isolations were performed using miRNeasy Mini Kit (Qiagen). RNA library preparations were carried out on 500 ng of RNA with RNA integrity number (RIN)>7 isolated from adipocyte cells extracted from AT113,114 and blood cells (CD19+, CD3+ and CD14+) using the Illumina TruSeq Stranded Total RNA Sample preparation kit, according to manufacturer's protocol. Final libraries were analysed on a Bioanalyzer and sequenced on the Illumina HiSeq2000 (pair-ended 100 bp sequences). Raw reads were trimmed for quality (phred33≥30) and length (n≥32), and Illumina adapters were clipped off using Trimmomatic v. 0.32111. Filtered reads were aligned to the hg19 human reference using STAR v.2.5.1b152. Raw read counts of UCSC genes were obtained using htseq-count v.0.6.1 (http://www-huber.embl.de/users/anders/HTSeq).

Differential expression analysis was done using DESeq2 v.1.18.1 153 on RNA-seq data from adipocytes isolated from adipose tissue (subcutaneous and visceral) of 20 obese individuals undergoing bariatric surgery (IUCPQ) and different blood cell types

153 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

(N=11 B cells; N=20 T cells; N=20 monocytes) of healthy European individuals

(Uppsala Blood Transfusion Center, Uppsala University Hospital, Sweden). We used stringent cutoffs to define adipocyte-specific expression – requiring log2-fold- change>2 and p<0.05 across all six comparisons of adipocytes to blood cell types.

3.7.7 Linking gene expression to methylation in MuTHER cohort

We expanded on a previously published methylation-expression association analysis performed within the MuTHER cohort66 to assess possible long-range interactions

(+/- 1Mb) for CpGs mapping to both LMRs and UMRs. We restricted to 145,913 450K

CpGs residing in 27,258 adipose regulatory regions and tested for association to

20,326 expression probes (IlluminaHT12) for 602 individuals with matched samples.

We used a similar linear mixed-effects model as described previously66, implemented with the lme4 package154 lmer() function fitted by maximum likelihood.

As before, the linear mixed-effects model was adjusted for both fixed effects (age, beadchip, BS conversion efficiency, BS-treated DNA input) and random effects (family relationship and zygosity) but here we added BMI as an additional covariate. We used a likelihood ratio test to assess the significance of the gene expression effect. The p- value of the gene expression effect in each model was calculated from the Chi-square distribution with 1 degree of freedom (df) and −2log(likelihood ratio) as the test statistic. In total, we tested 4,245,804 methylation to gene expression associations and assessed the false-discovery rate (FDR 10%) using the R/Bioconductor q-value package149.

154 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

3.7.8 Association of gene expression to lipids in MuTHER cohort

Associations between gene expression levels (IlluminaHT12) and lipid status within the MuTHER cohort were modeled using a linear mixed effects model as described previously155. Briefly, the lmer function in the lme4 package154, was fitted by maximum-likelihood. The linear mixed effects model was adjusted for age and experimental batch (fixed effects) and family relationship (twin-pairing) and zygosity

(random effects). A likelihood ratio test was used to assess the significance of the phenotype effect. The p-value of the phenotype effect in each model was calculated from the Chi-square distribution with 1 degree of freedom using -2log(likelihood ratio) as the test statistic.

3.7.9 Gene enrichment pathway analyses

Core expression analyses were performed using default settings in the Ingenuity

Pathway Analysis software. Only the top 5 canonical pathways are reported. We ran the software on (1) 30 genes showing associations to both replicated adipose lipid-

CpGs mapping with adipose regulatory regions and to the same lipid trait independently (see “Functional annotation of lipid-CpGs”), and (2) 52 genes directly overlapping the 68 lipid-linked adipose regulatory regions validated in whole blood

(see “Tissue-specificity of lipid-linked regulatory regions”).

3.7.10 Conditional modelling of HDL-EWAS on SNPs

Genotypes for the 21 SNPs in the region of interest within GALNT2 (Supplementary

Data 8) were generated for 148 adipose IUCPQ samples. For 56/113 discovery

155 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

samples profiled via MCC-Seq using MetV1, genotypes were typed on the high- density genotyping using the Illumina HumanOmni2.5-8 (Omni2.5) BeadChip according to protocols recommended by Illumina. For 92 discovery samples profiled via MCC-Seq using MetV2, genotypes were inferred using the Bis-SNP software109, a bisulfite-sequencing variant caller, with default parameters: ‘-T BisulfiteGenotyper - stand_call_conf 20 -stand_emit_conf 0 -mmq 30 -mbq 17 -minConv 0' and with dbSNP

137 as prior SNP information. The aligned bam files were used as input file and the hg19 was used as the reference genome.

Conditional modelling of HDL-association at chr1:230313001 for the 21 SNPs in the region of interest within GALNT2 (Supplementary Data 8) was carried out independently for each SNP by adding the genotype status as a covariate in the GLMs as described in “Epigenome-wide association of plasma lipid levels” above.

3.7.11 Code availability

No custom code or software were used in this study.

156 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

3.8 Data Availability

The methylation and expression data from the MuTHER cohort have been deposited in the ArrayExpress, https://www.ebi.ac.uk/arrayexpress/ (accession no. “E-MTAB-

1866” and “E-TABM-1140”. Lipid-EWAS results from the adipose and whole blood

IUCPQ cohorts as well as the whole blood CARTaGENE cohort can be visualized in the UCSC Genome Browser by adding the following URL to “My Hubs”: https://emc.genome. mcgill.ca/myHub/hub_adipose.txt. Raw MCC-Seq reads from the

IUCPQ cohorts are deposited to the European Genome-phenome Archive (EGA) and available (accession no. EGAS00001003415) after approval by the Data Access

Committee (DAC) designated to the study (https://www.ebi.ac.uk/ega/home). All other relevant data supporting the key findings of this study are available within the article and its Supplementary Information files or from the corresponding author upon reasonable request. A reporting summary for this Article is available as a

Supplementary Information file.

157 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

3.9 Acknowledgements

This work was supported by a Canadian Institute of Health Research (CIHR) team grant awarded to E.G. and M.L. (TEC-128093), a CIHR Foundation grant awarded to E.G (148391) and the CIHR funded Epigenome Mapping Centre at McGill

University (EP1-120608) awarded to T.P. and M.L. E.G. holds the Roberta D. Harding

& William F. Bradley, Jr. Endowed Chair in Genomic Research and T.P. holds the Dee

Lyons/Missouri Endowed Chair in Pediatric Genomic Medicine. A.T. is the director of a Research Chair in Bariatric and Metabolic Surgery. M.C.V. holds the Canada

Research Chair in Genomics Applied to Nutrition and Health (Tier 1). F.A. held a studentship from The Fonds de recherche du Québec (FRSQ) during part of this study.

The study was further supported by the Swedish Rheumatism Association and King

Gustaf V's 80-years Foundation together with The Swedish Research Council and

Wallenberg Foundation awarded to L.R. This study was also supported by the NIHR

Oxford Biomedical Research Centre. The views expressed are those of the authors and not necessarily those of the NHS, the NIHR or the Department of Health. The

MuTHER Study was funded by a program grant from the Wellcome Trust

(081917/Z/07/Z) and core funding for the Wellcome Trust Centre for Human Genetics

(090532). The TwinsUK study was funded by the Wellcome Trust and European

Community’s Seventh Framework Programme (FP7/2007-2013). The TwinsUK study also receives support from the National Institute for Health Research (NIHR)- funded

BioResource, Clinical Research Facility and Biomedical Research Centre based at

158 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Guy's and St Thomas' NHS Foundation Trust in partnership with King's College

London.

We thank the NIH Roadmap Epigenomics Consortium and the Mapping Centers

(http://nihroadmap.nih.gov/epigenomics/) for the production of publicly available reference epigenomes. Specifically, we thank the mapping centres at MGH/BROAD and UCSF for generation of human adipose (donor 92 and 7) and peripheral blood

(TC014 and TC015) reference epigenomes used in this study, respectively.

We further thank additional members of the MuTHER consortium for providing valuable data for this study. Please see the Supplementary Note in the

Supplementary Information document for a full list of additional MuTHER members not already included in the author list.

159 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

3.10 Author Information

Competing interests

The authors declare no competing interest. A.T. receives research funding from

Johnson & Johnson Medical Companies and Medtronic for studies unrelated to this manuscript.

160 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

3.11 Main Tables and Figures

3.11.1 Tables

Table 1. Genetic regulation on lipid-linked adipose regulatory regions

Genetic regulation enrichment Lipid-linked regulatory regions (fold-change) All lipid-linked elements (N=567) 1.5 Adipose-specific elements (N=226) 1.1 Tissue-shared elements (N=341) 1.7 Tissue-shared elements validated in 2.1 blood cohort 1 (N=68) Tissue-shared elements validated in 2.2 blood cohort 1 and 2 (N=22)

161 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

3.11.2 Figures

Figure 1. Study flow chart

Overview of included study cohorts and follow-up analyses to characterize identified lipid-linked adipose tissue regulatory regions.

162 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Figure 2. Positional mapping of lipid-CpGs within adipose tissue regulatory elements

a b 0.015 TSS

0.0100

0.010

0.1q density 0.0050 full 0.005

gene 0.000 0.0000 −50 −25 0 25 50 −50 −25 0 25 50 distance from midpoint of putative distance from midpoint of putative enhancer regions (%) promoter regions (%)

c Discovery lipid-EWAS MuTHER lipid-EWAS d Discovery lipid-EWAS MuTHER lipid-EWAS 5 8 4 ) E

U 6 L

A 3 V - P

( 4 2 10 G O

L 2

- 1

0 0 9 9 9 9 6 6 6 5111 446462 446562 446662 22684961 22685011 22685061 2268 7 7 7 CHROMOSOME 1 CHROMOSOME 15

Specific positional trends of significant lipid-CpGs (FDR 10%) merged across all studied lipids traits (i.e. triglycerides, HDL-C, LDL-C and total cholesterol) were investigated at adipose regulatory regions. Positions of CpGs were tabulated as the percent distance from the midpoint of elements (genomic distance from midpoint

(bp)/length of element(bp)*100) and collapsed to summarize positional trends over all assessed elements. Positional trends are shown for (a) CpGs mapping to LMRs

(N=225,771) and (b) CpGs mapping to UMRs within +/-1.5kb of transcription start

163 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

sites (TSS) not depicting bivalent gene transcription orientations (taking gene orientation into account; N=418,246). The fine-mapping potential of MCC-Seq over array-based methods is exemplified in (c) a replicated HDL-linked enhancer region

(chr1:226849619-226851122) and (d) a replicated LDL-linked promoter region

(chr15:74464626-74466792), where we noted top discovery lipid-CpGs to mimic trends noted in (a) and (b) at adjacent sites to signals identified from the large-scale

450K-based MuTHER study.

164 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Figure 3. TG-linked adipose-specific regulatory region shows putative pleiotropic effects

a Scale 500 kb hg19 chr19: 1,500,000 2,000,000 2,500,000 3,000,000 APC2 MBD3 ONECUT3 DOT1L LSM7 LMNB2 MIR7850 SGTA GNA15 C19orf25 UQCR11 ADAT3 MKNK2 TMPRSS9 GNG7 ZNF554 ZNF77 AES S1PR4 PCSK4 ATP8B3 PLEKHJ1 CSNK1G2 IZUMO4 TIMM13 DIRAS1 ZNF555 TLE6 REEP6 TCF3 MIR1227 MIR7108 GNA11 REXO1 BTBD2 SLC39A3 ZNF556 ADAMTSL5 MIR6789 LINC01775 NCLN MIR1909 MOB3A SF3A2 PLK5 LOC100288123 AP3D1 AMH GADD45B THOP1 ZNF57 TLE2 MEX3D KLF16 JSRP1 AES ABHD17A MIR4321 OAZ1 LOC100996351 SCAMP4 C19orf35 CSNK1G2-AS1 LINGO3 SPPL2B Adipose regulatory regions Whole-blood regulatory regions

b Scale 1 kb hg19 chr19: 2,331,500 2,332,500 2,333,500 2,334,500 SPPL2B

Discovery adipose TG-EWAS 5 _ 0 _ 5 _ IUCPQ whole-blood TG-EWAS 0 _ 5 _ -log10(p-value) CARTaGENE whole-blood TG-EWAS 0 _ cg05660874 Illumina 450K probes cg10723746 Adipose regulatory regions LMR Whole-blood regulatory regions UMR

Adipocyte H3K4me1

Peripheral blood H3K4me1

Adipocyte H3K4me3

Peripheral blood H3K4me3

Scale c chr19: 100 kb hg19 2,550,000 2,600,000 2,650,000 2,700,000 GNG7 MIR7850 cg19853565 cg10884953 cg19653589 cg15468423 Illumina 450K probes cg07843390 Illumina HT-12 probes ILMN_1728107 LMR LMR UMR LMR UMR LMR UMR UMR LMR LMR LMR Adipose regulatory regions UMR UMR LMR LMR UMR LMR LMR UMR LMR

Adipocyte H3K4me1

Adipocyte H3K4me3

d Scale 10 kb hg19 chr19: 3,135,000 3,145,000 3,155,000 3,165,000 GNA15 LOC100996351 Illumina 450K probes cg26870745 cg03302088 Illumina HT-12 probes ILMN_1773963 Adipose regulatory regions LMR LMR LMR LMR UMR

Adipocyte H3K4me1

Adipocyte H3K4me3

165 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

A top discovery TG-CpG (chr19:2332436; corrected p=2.4x10-5; sky blue track) replicated by multiple nearby MuTHER TG-CpGs (cg05660874; p=5.1x10-10; cg10723746; p=1.0x10-8; light green track) locates within an adipose-specific enhancer region (chr19:2332094-2333076) overlapping the first intron of SPPL2B

(LMR; shown in red in (a) the broad and (b) zoomed-in view). Methylation levels at cg05660874 and cg10723746 show associations to cis-locating REEP6, MKNK2,

GNG7 and GNA15 (highlighted in red in (a)), which in turn exhibit associations to

TG levels in the MuTHER cohort with GNG7 and GNA15 showing the strongest links

((c) GNG7; cg05660874 vs. ILMN_1709247; p=4.9x10-5; cg10723746 vs.

ILMN_1709247; p=3.1x10-8; ILMN_1709247 vs TG; p=1.2x10-12; (d) GNA15;

ILMN_1773963 vs. cg10723746; p=1.5x10-17; ILMN_1773963 vs. cg05660874; p=1.5x10-16; ILMN_1773963 vs. TG; p=1.5x10-18). We show evidence for a co- regulation network between these two genes and the enhancer region by highlighting associations between 450K array probes (light green tracks in (b), (c) and (d)) locating to several regulatory regions (shown in red in (b) and teal in (c) and (d)) and expression levels of (c) GNG7 and (d) GNA15 in MuTHER. We show a lack of whole blood lipid-EWAS signals at the enhancer of interest (b), which is supported by the adipocyte-specific nature of chromatin signatures observed at the locus (Roadmap

Epigenomics Consortium; adipocyte nuclei donor 92 shown in orange vs. peripheral blood donor TC015 shown in green).

166 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Figure 4. HDL-C linked adipose-specific regulatory region under genetic regulation

Scale 100 kb hg19 a chr1: 230,200,000 230,250,000 230,300,000 230,350,000 230,400,000

GALNT2 NHGRI-EBI Catalog of GWAS rs4846914

Top independent HDL-GWAS SNP rs627702

Discovery adipose HDL-EWAS

IUCPQ whole-blood HDL-EWAS

CARTaGENE whole-blood HDL-EWAS -log10(p-value)

Adipose regulatory regions UMR LMR LMR LMR LMR LMR LMR LMR LMR LMR LMR LMR

Whole-blood regulatory regions UMR LMR LMR LMR LMR LMR LMR LMR LMR LMR LMR LMR LMR LMR Adipose WGBS profile

Adipocyte RNA-Seq (forward)

Peripheral blood RNA-Seq (merged)

b Scale 5 kb hg19 chr1: 230,310,000 230,315,000 230,320,000 230,325,000 GALNT2 rs627702 Top independent HDL-GWAS SNP rs611841 rs637180 rs671682 rs586712 rs636810 rs677990 rs598203 rs628035 rs2760537 Top metQTL SNPs rs666718 rs607553 rs653523 rs678578 rs600845 rs611701 rs678050 rs650808 rs612577 rs2748115 rs57992758 rs585361 Discovery adipose HDL-EWAS 5 _ 0 _ IUCPQ whole-blood HDL-EWAS 5 _ 0 _ 5 _ -log10(p-value) CARTaGENE whole-blood HDL-EWAS 0 _

Illumina 450K probes

Adipose regulatory regions LMR

Adipocyte H3K4me1

Peripheral blood H3K4me1

Adipocyte H3K27ac

Peripheral blood H3K27ac

Adipocyte H3K4me3

Peripheral blood H3K4me3

167 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

A discovery HDL-CpG (chr1:230313001; corrected p=2.0x10-5; sky blue track) maps within an intragenic region of GALNT2 (chr1:230312462-230313455) overlapping an adipose-specific putative enhancer region (LMR; shown in red in (a) the broad and (b) zoomed-in view). The adipose-specific nature of the epigenetic signature at this locus is supported by patterns in adipocyte nuclei (Roadmap Epigenomics Consortium; donor 92 for H3K4me1 and H3K4me3; donor 7 for H3K27ac; orange tracks) versus peripheral blood (Roadmap Epigenomics Consortium; donor TC015; green tracks) chromatin marks as well as from intersecting whole blood EWAS signals (pink and dark orange tracks). We show that the enhancer region is under extensive genetic regulation by nearby cis-SNPs (grey blue tracks in (b)) that are in high LD (r2>0.9) with an HDL-linked GWAS SNP (Global Lipids Consortium; rs627702; p=5.0x10-24; purple tracks), which is independent of the previously reported top HDL-linked SNP at this locus (rs4846914; p=4.0x10-41; dark green track in (a)). We depict a lack of coverage of the 450K array at this region. Adipocyte-specific (in-house data; light orange track) and peripheral blood RNA-Seq (Roadmap Epigenomics Consortium; donor TC014; light green track) data at the locus is also depicted in (a).

168 CHAPTER 3: DISSECTING FEATURES OF EPIGENETIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

3.12 Supplementary Materials

3.12.1 Supplementary Tables

Supplementary Table 1. Characteristics of the study cohorts

Discovery MuTHER IUCPQ Trait CARTaGENE Adipose Adipose Whole blood Whole blood Study population Obese Normal Obese Normal N (% female) 199 (60%) 648 (100%) 206 (55%) 137 (35%) Age (years) [SD] 37.2 [8.8] 58.9 [9.4] 38.9 [9.8] 55.2 [7.8] BMI (kg/m^2) [SD] 53.7 [8.9] 26.7 [4.8] 51.5 [8.8] 26.8 [4.4] Triglycerides (mmol/L) [SD] 1.5 [0.7] 1.1 [0.6] 1.5 [0.6] 1.6 [0.9] HDL-C (mmol/L) [SD] 1.3 [0.3] 1.8 [0.5] 1.3 [0.3] 1.3 [0.4] LDL-C (mmol/L) [SD] 2.8 [0.7] 3.3 [1.0] 2.8 [0.8] 3.0 [0.9] Total cholesterol (mmol/L) [SD] 4.8 [0.8] 5.6 [1.1] 4.8 [0.9] 5.0 [1.0]

169 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Supplementary Table 2. Size and CpG density characterization of adipose regulatory regions

Discovery CpG density Length Adipose regulatory (CpGs/region) region type mean (bp) min (bp) max (bp) mean min max LMR 804 100 8688 7 1 29 UMR 2199 233 6992 37 1 355

170 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Supplementary Table 3. Overlap between discovery CpGs versus those on the EPIC and 450K array at adipose

regulatory regions

Adipose EPIC array CpGs MuTHER array CpGs Discovery regulatory Directly overalpping Directly overalpping CpGs Total CpGs* Total CpGs* region type discovery CpGs* discovery CpGs* LMR 225,771 38,416 (17%) 23,317 (10%) 13,256 (6%) 9,407 (4%) UMR 696,492 201,113 (29%) 83,159 (12%) 133,755 (19%) 56,422 (8%) * Percentages are calculated using the total number of discovery CpGs per category as the denominator

171 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Supplementary Table 4. Transcription factor binding site motifs at regions flanking replicated MuTHER lipid-CpGs

mapping to UMRs

Target Background Candidate Log p- q-value sequences sequences Rank Motif Name Consensus Entrez p-value gene value (Benjamini) with motifs with motif (%) (%)

STAT5(Stat)/mCD4+- 1 Stat5-ChIP- RTTTCTNAGAAA 6776 STAT5A 1.00E-04 -1.11E+01 3.90E-03 0.6 0.1 Seq(GSE12346)/Homer STAT1(Stat)/HelaS3- 2 STAT1-ChIP- NATTTCCNGGAAAT 6772 STAT1 1.00E-04 -1.02E+01 4.90E-03 0.6 0.1 Seq(GSE12782)/Homer Stat3+il21(Stat)/CD4- 3 Stat3-ChIP- SVYTTCCNGGAARB 6774 STAT3 1.00E-03 -8.14E+00 2.58E-02 0.8 0.4 Seq(GSE19198)/Homer NF1(CTF)/LNCAP-NF1- 4 ChIP- CYTGGCABNSTGCCAR 4781 NFIB 1.00E-03 -7.40E+00 4.03E-02 0.8 0.3 Seq(Unpublished)/Homer RUNX(Runt)/HPC7- 5 Runx1-ChIP- SAAACCACAG 861 RUNX1 1.00E-03 -6.99E+00 4.89E-02 0.7 0.3 Seq(GSE22178)/Homer

172 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Supplementary Table 5. Top canonical pathways for genes modulated by replicated lipid-linked regulatory regions and further linked to the same circulating lipid traits

Total number of Ingenuity canonical pathways P-value Molecules molecules G!q Signaling 6.94E-05 GNA15, CSK, KLB, GNG7 4 Ephrin B Signaling 1.40E-04 GNA15, RAC3, GNG7 3 SAPK/JNK Signaling 4.96E-04 KLB, RAC3, GNG7 3 Relaxin Signaling 1.34E-03 GNA15, KLB, GNG7 3 Tec Kinase Signaling 1.65E-03 GNA15, KLB, GNG7 3

173 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Supplementary Table 6. Top canonical pathways for genes overlapping lipid-linked regulatory regions replicating in whole blood

Total number of Ingenuity canonical pathways P-value Molecules molecules Adipogenesis pathway 3.08E-03 HDAC4, BMP4, AKT1 3 S-methyl-5-thio-"-D-ribose 1-phosphate Degradation 6.50E-03 MRI1 1 Oxidized GTP and dGTP Detoxification 6.50E-03 NUDT1 1 Axonal Guidance Signaling 1.71E-02 BMP4, AKT1, RHOD, GDF7 4 Ceramide Signaling 1.96E-02 AKT1, CERK 2

174 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

3.12.2 Supplementary Figures

Supplementary Figure 1. QQplots for EWAS of TG to methylation associations before and after correction

A B

20

10 Observed -log10P

0

0 2 4 6 0 2 4 6 Expected -log10P

Associations between triglycerides (TG) levels and methylation in visceral adipose tissue were assessed at 1,299,825 CpGs. The Bayesian method BACON was applied to control for bias and inflation of our test-statistics. QQplots of p-values (a) before

(lambda=1.6148) and (b) after (lambda=1.0387) statistical correction are shown. FDR

10% (blue dotted line) and FDR 5% (orange dotted line) cutoffs are depicted.

175 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Supplementary Figure 2. QQplots for EWAS of HDL to methylation associations before and after correction

30 A B

20

Observed -log10P 10

0

0 2 4 6 0 2 4 6 Expected -log10P

Associations between HDL-C (HDL) levels and methylation in visceral adipose tissue were assessed at 1,299,825 CpGs. The Bayesian method BACON was applied to control for bias and inflation of our test-statistics. QQplots of p-values (a) before

(lambda=1.5232) and (b) after (lambda=1.0718) statistical correction are shown. FDR

10% (blue dotted line) and FDR 5% (orange dotted line) cutoffs are depicted.

176 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Supplementary Figure 3. QQplots for EWAS of LDL to methylation associations before and after correction

A B

20

Observed -log10P 10

0

0 2 4 6 0 2 4 6 Expected -log10P

Associations between LDL-C (LDL) levels and methylation in visceral adipose tissue were assessed at 1,299,825 CpGs. The Bayesian method BACON was applied to control for bias and inflation of our test-statistics. QQplots of p-values (a) before

(lambda=1.4465) and (b) after (lambda=1.1088) statistical correction are shown. FDR

10% (blue dotted line) and FDR 5% (orange dotted line) cutoffs are depicted.

177 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Supplementary Figure 4. QQplots for EWAS of TC to methylation associations before and after correction

A B

15

10 Observed -log10P

5

0

0 2 4 6 0 2 4 6 Expected -log10P

Associations between total cholesterol (TC) levels and methylation in visceral adipose tissue were assessed at 1,299,825 CpGs. The Bayesian method BACON was applied to control for bias and inflation of our test-statistics. QQplots of p-values (a) before

(lambda=1.5779) and (b) after (lambda=1.0499) statistical correction are shown. FDR

10% (blue dotted line) and FDR 5% (orange dotted line) cutoffs are depicted.

178 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Supplementary Figure 5. Significant associations between methylation and lipid phenotypes in the discovery cohort

a. FDR 10% b. FDR 5%

TG LDL-C TG LDL-C

280 180 141 82

2 1

HDL-C 49 91 TC HDL-C 34 48 TC

382 0 5 223 192 0 3 104

1 0 1 4 1 2

5 1 4 0

6 3

Associations between lipid traits (i.e. triglycerides (TG), HDL-C, LDL-C and total cholesterol (TC) levels) and CpG methylation were assessed at 1,299,825 CpGs.

Overlaps between the different lipid-CpG sets are depicted in Venn diagrams for lipid-CpGs significant at (a) FDR 10% (N=1,230 lipid-CpGs) and (b) FDR 5% (N=615 lipid-CpGs).

179 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Supplementary Figure 6. Methylation range variance across CpGs within the discovery cohort

100

75

50

25 Methylation range per CpG (%)

0

Discovery CpGs Discovery lipid-CpGs The range of methylation captured at each CpG across individuals was assessed in the discovery adipose tissue cohort. Boxplot representations of the methylation range per CpG (y-axis) are depicted for (red) all CpGs captured in the discovery cohort via the MCC-Seq method (N=1,299,825 CpGs) and (teal) lipid-CpGs significant at FDR

10% in the discovery cohort (N=1,230 CpGs).

180 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Supplementary Figure 7. Annotation of lipid-CpGs among adipose tissue regulatory elements

lipid-CpGs at q<0.1 top 25% variable lipid-CpGs at q<0.1 1.5

1 ) e

g 0.5 n a h c - d l o f ( 2 0 g o l

-0.5

-1 Adipose tissue Adipose-specific Adipose tissue Adipose-specific enhancers enhancers promoters promoters

Discovery cohort CpGs showing association to lipid traits were mapped and annotated to adipose tissue regulatory regions. Trends observed for all lipid-CpGs

(blue) and those within the top 25th percentile of methylation variability (orange) are contrasted. Significant enrichment (y-axis) of lipid-CpGs within adipose tissue putative enhancer regions (low-methylated regions; LMRs) was observed for both sets of lipid-CpGs (blue p=6.6x10-13; orange p=2.7x10-16), which was strengthened when limiting to adipose-unique LMRs (blue p=9.9x10-13; orange p<2.2x10-16). Significant depletion (y-axis) of lipid-CpGs was noted within adipose tissue putative promoter regions (unmethylated regions; UMRs; blue p<2.2x10-16; orange p<2.2x10-16). In contrast, enrichment was again found when restricting to adipose-unique UMRs (blue p=8.1x10-11; orange p=1.1x10-7). Association between lipid traits and CpG

181 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

methylation was tested at 1,299,825 CpGs. Fold-change significance was calculated using Fisher's exact test.

182 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Supplementary Figure 8. Mean CpG coverage across adipose tissue regulatory elements

a b

50 60

40 age age r r e e v v 40 o o 30 0.1q

20 full 20

10 mean CpG c mean CpG c

0 0 >0 to 25 >0 to 25 >25 to 50 >25 to 50 >−25 to 0 >−25 to 0 −50 to −25 −50 to −25

distance from midpoint of putative distance from midpoint of putative enhancer (LMR) regions (bins in %) promoter (UMR) regions (bins in %)

To account for possible biases attributable to coverage, the mean CpG coverage within

25% bins of the total distance across the elements were tabulated and plotted for (a) all CpGs (N=225,771) and lipid-CpGs at FDR 10% (N=314) mapping to LMRs and,

(b) all CpGs (N=418,246) and lipid-CpGs at FDR 10% (N=225) mapping to UMRs within +/-1.5kb of transcription start sites (TSS) not depicting bivalent gene transcription orientations.

183 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Supplementary Figure 9. Positional mapping of CpGs overlaying Illumina 450K and

EPIC array probes

MCC-Seq CpGs overlapping Illumina 450K and EPIC array CpGs and mapping to adipose tissue regulatory promoter regions (UMRs) were further investigated for specific positional trends. Positions of CpGs were tabulated as the percent distance from the midpoint of elements (genomic distance from midpoint (bp)/length of element(bp)*100) and collapsed to summarize positional trends over all assessed elements. Positional trends within UMRs mapping to +/-1.5kb of transcription start sites (TSS) not depicting bivalent gene transcription orientations (taking gene orientation into account) are shown for (a) all CpGs overlaying 450K array CpGs

(N=93,648) and (b) all CpGs overlaying EPIC array CpGs (N=137,156).

184 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Supplementary Figure 10. Expression profile of STAT5A across the multiple tissues in GTEx

Expression levels are assessed in TPM (transcripts per kilobase million; y-axis) for various tissues. Focusing on subcutaneous adipose, visceral adipose and whole blood tissues, values of 85.865, 54.950 and 40.860 TPM are reported for this gene, respectively (GTEx Portal; October 2018).

185 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Supplementary Figure 11. Expression profile of STAT3 across the multiple tissues in

GTEx

Expression levels are assessed in TPM (transcripts per kilobase million; y-axis) for various tissues. Focusing on subcutaneous adipose, visceral adipose and whole blood tissues, values of 126.240, 140.190 and 105.300 TPM are reported for this gene, respectively (GTEx Portal; October 2018).

186 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Supplementary Figure 12. Expression profile of NFIB across the multiple tissues in

GTEx

Expression levels are assessed in TPM (transcripts per kilobase million; y-axis) for various tissues. Focusing on subcutaneous and visceral adipose tissues values of

39.790 and 36.470 TPM are reported for this gene (GTEx Portal; October 2018).

187 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

Supplementary Figure 13. Genomic distance between CpGs and their top associated

SNP

Adipose tissue SNP-CpG associations (metQTL; +/-250kb) were overlapped with discovery cohort CpGs (N=1,299,494). We depict the genomic distance between the

CpGs and their top associated SNP for metQTLs at FDR 10% (N=110,957;

Medianabs=69,303bp), FDR 5% (N=64,240; Medianabs=46,183bp) and, FDR 1%

(N=27,392; Medianabs=17,230bp) – noting an enrichment of SNPs regulating methylation in the vicinity of their linked CpGs.

188 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

3.12.3 Supplementary Notes

Supplementary Note 1. List of MuTHER members thanked in the acknowledgements

Kourosh R. Ahmadi, Chrysanthi Ainali, Amy Barrett, Veronique Bataille, Jordana T.

Bell, Alfonso Buil, Emmanouil T. Dermitzakis, Antigone S. Dimas, Richard Durbin,

Daniel Glass, Neelam Hassanali, Catherine Ingle, David Knowles, Maria

Krestyaninova, Cecilia M. Lindgren, Christopher E. Lowe, Eshwar Meduri, Paola di

Meglio, Josine L. Min, Stephen B. Montgomery, Frank O. Nestle, Alexandra C. Nica,

James Nisbet, Stephen O'Rahilly, Leopold Parts, Simon Potter, Johanna Sandling,

Magdalena Sekowska, So-Youn Shin, Kerrin S. Small, Nicole Soranzo, Gabriela

Surdulescu, Mary E. Travers, Loukia Tsaprouni, Sophia Tsoka, Alicja Wilk, Tsun-Po

Yang & Krina T. Zondervan.

189 CHAPTER 3: DISSECTING FEATURES OF EPIGENTIC VARIANTS UNDERLYING CARDIOMETABOLIC RISK

3.12.4 Supplementary Data

All Supplementary Data (Data 1-8) can be accessed online from the open access publication Allum et al.156 in Nature Communications using the following link: https://www.ncbi.nlm.nih.gov/pmc/articles/PMC6418220/

190 CHAPTER 4: GENERAL DISCUSSION

Chapter 4: General Discussion

Genetic studies of complex metabolic diseases have so far revealed solid, but minor genetic influences enriched in non-coding regions of the human genome. These findings represent a major challenge in the translation of GWAS results into biological knowledge. The aim of this doctoral project was to link genome with environment through epigenetic variation assessments, thus providing another avenue of exploration to contribute to our understanding of common disease biology.

Through our studies, we demonstrated that integrative analyses of full-resolution epigenome profiles in disease-linked tissues allow for fine-mapping of epigenetic and genetic trait-associated variants. We further complemented ongoing reference epigenome efforts by providing, for the first time, methylation and chromatin accessibility maps in the disease-relevant visceral adipose tissue and purified adipocytes, respectively. In Chapter 2, we presented MCC-Seq as a cost-effective and customizable method for interrogation of functional methylomes. We showed the far- reaching potential of MCC-Seq in large-scale EWAS for full investigation of tissue- specific open chromatin regions and their association to complex traits. In Chapter 3, we expanded the pilot study highlighted in Chapter 2 to present a large-scale EWAS of methylation status and cardiometabolic risk-linked complex traits in disease- associated (i.e. visceral adipose tissue) and bioavailable (i.e. whole blood) tissues. We focused on characterizing the identified trait-linked regions for broader understanding of the common disease-associated methylome landscape. Here, we

191 CHAPTER 4: GENERAL DISCUSSION

discuss the main strengths of the presented doctoral work including some avenues for improvement.

Novel next-generation capture sequencing method

Although WGBS represents the gold-standard for genome-wide methylome interrogation, its current high cost makes the application of this technique in large- scale cohorts prohibitive. Recent studies have further demonstrated that only a minor fraction of the methylome (i.e. less than 20%) shows variability across tissues and individuals65. As such, most epigenome-wide studies to date have applied more affordable array-based technologies (e.g. 450K array) to study DNA methylation variation. While providing useful information on various CpGs on a genome-wide scale, the 450K array was designed to mainly target promoter regions, which have been shown to be less variable and, thus, harder to statistically link to disease without using very large sample sizes. The advent of the more recent methylation microarray known as the EPIC array promised to improve on this bias. However, we showed that while progress in terms of diversification of targeted genomic regions has been achieved, a bias to promoter regions as well as insufficient coverage of dynamic CpGs for the majority of tissues remains156.

To counter these limitations, we presented a novel next-generation capture sequencing approach - MCC-Seq - as a high-powered tool for interrogation of functional methylomes over target regions. We collaborated with Roche NimbleGen to implement this targeted enrichment approach and showed that MCC-Seq supports

192 CHAPTER 4: GENERAL DISCUSSION

multiplexing of samples using user-defined capture panels targeting up to 200Mb.

These characteristics enable researchers to drive laboratory-related costs down while maintaining sufficient coverage over target regions. Given our own interest in metabolic diseases, we focused our efforts on developing panels that target active regulatory regions in adipose tissue117. We generated a pilot panel (MetV1) and a more comprehensive panel (MetV2) that target up to ~2.5 x 106 and ~4.5 x 106 CpGs mapping within the functional methylome of adipose tissue, respectively. We exploited the ability of MCC-Seq to capture both strands over target regions and, thereby, provide reliable and accurate genotype calls, by including SNPs from the

Illumina HumanCore BeadChip array in the MetV2 panel for downstream imputation of genotypes. In this way, we strategically use MCC-Seq as a two-in-one method for simultaneous methylation and genotype profiling. The ability of MCC-Seq to infer genotypes within captured DNA fragments represents an advantage over other targeted profiling techniques such as Agilent SureSelect Human Methyl-Seq

(Agilent SureSelect)117. Importantly, the now commercialized MetV2 panel (Roche

NimbleGen; SeqCap Epi Developer XL Design #131010_HG19_EG_met_EPI) targets

10 times more CpGs than the 450K array and, still, 5 times more than the recently released EPIC array. Given that this panel encompasses all adipose regulatory regions, it represents an unprecedented opportunity to investigate the role of distal regulatory regions in complex diseases.

Through our work, we have demonstrated the utility of MCC-Seq over other available methylome and genotype profiling methods117,156. Specifically, we established

193 CHAPTER 4: GENERAL DISCUSSION

through comparisons of methylation calls, that MCC-Seq is as accurate as WGBS,

450K array and Agilent SureSelect (i.e. Spearman R>0.96)117. As mentioned above,

MCC-Seq allows for the investigation of a larger fraction of the active methylome versus other popular array-based techniques117 with similar price points.

Simultaneous investigation of both distal and proximal regulatory elements confirmed earlier reports65,66 that variable and complex trait-linked CpGs are enriched at putative enhancer regions with this trend being strengthened when focusing on unique tissue-specific regions of disease-relevant tissues117,156. This type of interrogation represents an important step forward that was lacking in prior common disease studies due to a depletion in coverage at distal regulatory regions by other array-based profiling techniques and RRBS. Additionally, deep single-base profiling of tissue-specific regulatory regions (i.e. average of 7 CpGs at enhancers and

37 CpGs at promoters) allowed us to generate novel positional trends of complex trait- associated epigenetic variants156. Comparisons of these full-resolution positional patterns with those captured by array-based approaches exemplified the limitations of the latter methods to assess CpGs within regulatory regions both in terms of low- density coverage and ascertainment biases due to design156. We further demonstrated that the dense CpG coverage by MCC-Seq at regulatory regions enables this technique to be used as an effective fine-mapping tool of epigenetic signals from previous 450K array-based studies156. This represents an interesting avenue for replication and expansion from past studies where, to date, larger cohorts have been profiled than with MCC-Seq.

194 CHAPTER 4: GENERAL DISCUSSION

Importance of using disease-linked tissues in epigenetic investigations

A notable strength of our studies was the use of visceral adipose tissue as our discovery cohort to broaden our understanding of the impact of epigenetic variants in metabolic diseases. Most epigenome-wide studies of complex traits have been conducted in bioavailable tissues such as whole blood and extracted cell fractions76,78,82-84,86,125. However, investigations have shown that epigenetic signals, especially those under environmental-modulation, are mainly cell-specific and, therefore, not always reflected in whole blood samples65,66,128. Similar observations have been noted for genetic variants linked to complex traits where GWAS SNPs were found to be enriched in tissue-specific regulatory regions of biologically-relevant and disease-linked tissues 25-28. As such, we presented the first large-scale investigations of epigenetic variants in visceral adipose tissue and their implication in metabolic diseases117,156. Through our close collaboration with the IUCPQ, we generated and publicly released the first epigenome maps of active methylomes, transcriptomes and chromatin accessibility (ATAC-Seq) for this tissue type. This contribution has been acknowledged by the broader research community who frequently download and use these datasets in independent works. For instance, our adipocyte-specific ATAC-Seq reference maps have been used in comparison analyses with chromatin accessibility tracks of human pancreatic islets157 and skeletal muscle tissue158 in recently published studies. With our EWAS of plasma lipids, we showed the importance of using disease-linked tissues as a starting material for increased discovery power of biologically-relevant epigenetic signals. We reported 518 TG-linked CpGs117 and 567

195 CHAPTER 4: GENERAL DISCUSSION

lipid-linked adipose regulatory regions156 at stringent cut-offs in chapter 2 (N=72

VAT samples) and chapter 3 (N=199 VAT samples), respectively. These numbers represent more than 2.5-fold improvement in initial discovery potential for epigenetic variants compared to even the latest study of lipid traits where over 2,000 whole blood samples were interrogated using the 450K array86. This significant increase in detection power may additionally be attributable to our ability to interrogate tissue- specific distal regulatory regions using MCC-Seq. These findings demonstrate the necessity of interrogating epigenetic variants of active methylomes in biologically- sound tissues and cell types to further our understanding of complex disease etiology.

The novelty of our investigations in visceral adipose tissue also represents a limitation in our studies due to a lack of available independent replication cohorts for the same tissue type. In both presented works, we collaborated with the MuTHER consortium66 to gain access to methylation and gene expression levels for subcutaneous adipose tissue (SAT) samples across a large cohort of female twins

(N>600 SAT samples) with well-defined phenotypes. We used these datasets as the best available proxy for our replication efforts and showed that a fraction of VAT epigenetic signals replicate to SAT despite biological differences between the two adipose tissue depots. We suggested that additional differences in cohort selection

(i.e. mixed-gender disease-based versus female-only population-based) could have contributed to the discordance in replication across the two cohorts. In fact, sex- differences are well recognized in metabolic disease risk including differences in fat distribution where obese men have proportionally more visceral fat than obese

196 CHAPTER 4: GENERAL DISCUSSION

women159-161. Although sex status was considered a cofactor in our EWAS, we were not able to specifically assess sex-specific epigenetic associations due to a lack of power. Finally, given that MCC-Seq is capable of targeting regions not yet explored by other widely-used methylation profiling methods, we were unable to carry out replication efforts for a fraction of our novel findings. Thus, future studies of visceral adipose tissue, and of bioavailable tissues, will benefit from the resource that we have generated through our efforts.

Features of complex trait-linked epigenetic variants

By using MCC-Seq, we were able to investigate, for the first time, epigenetic variants spanning user-defined active regulatory regions within a tissue of interest. This application permitted us to expand the repertoire of explored regions from past studies. As stated above, we focused our efforts on profiling adipose tissue due to its link to obesity and other metabolic diseases. We generated dense single-base profiles over target regions, which allowed for the dissection of disease trait-associated variants features not yet made possible with other widely used profiling techniques.

In both of our EWAS studies, we confirmed earlier findings66 that metabolic-trait epigenetic variants are enriched in enhancers and tissue-specific regions, yet poorly represented in more static promoter regions117,156. The higher resolution afforded by

MCC-Seq permitted us to map novel positional features of lipid-linked CpGs at regulatory elements. Specifically, we noted enrichment of lipid-CpGs at the midpoint of enhancer regions whereas those mapping to putative promoters exhibited a

197 CHAPTER 4: GENERAL DISCUSSION

bimodal distribution overlapping the TSS with a shift towards the gene body region of the element156. We noted an enrichment in binding motifs for adipogenesis and obesity-linked transcription factors such as STAT family members and NFIB overlaying lipid-CpG peaks and, therefore, hypothesized that these trends mirrored

TF hotspots.

DNA methylation is known to be variable across individuals and tissues, with genetic and environmental effects underlying this variability. By targeting the full active methylome in a common disease-linked tissue, we were able to further our understanding of the contribution of genetic effects in the regulation of complex trait- linked CpGs. Specifically, we integrated cis-SNP-CpG associations tabulated within the MuTHER cohort66 (Chapter 2) and the discovery IUCPQ cohort126 (Chapter 3) and found that a large fraction (>55%) of highlighted lipid-associated regulatory regions are under genetic regulation117,156. Genetic effects were found to be stronger at regions shared across tissues (i.e. adipose to whole blood tissue) and, even still, at shared promoters over distal regulatory elements, hinting at a genetically-driven regulatory mechanism over these regions. Interestingly, in both studies, we noted enrichment in genetic regulation by metabolic disease-linked GWAS SNPs at highlighted loci (e.g.

CD36). Convincingly, in Chapter 3, we found lipid-associated regulatory regions to be under genetic effects by GWAS SNPs linked to the same lipid traits under investigation (Global Lipids Genetics Consortium)2 – including at the obesity- associated GALNT2 locus. Given the interest in using bioavailable tissues for disease detection and prediction in a clinical setting, we were interested to further

198 CHAPTER 4: GENERAL DISCUSSION

characterize trait-associated epigenetic variants shared across tissues. Through our collaboration with the IUCPQ, we gained access to matched whole blood samples for the same obese population and generated reference methylome datasets. Epigenetic signals replicated across disease-linked tissues to bioavailable tissues showed distinct features, including a marked enrichment in genetic regulation (>93%)156.

A main strength of our studies was the generation of adipose-specific methylome, transcriptome, and chromatin accessibility epigenome maps. Using these and additionally available resources (e.g. histone marks from Roadmap) in integrational analyses permitted the annotation of our novel complex trait-linked epigenetic variants. In the first study, we were able to confirm that our hypomethylated footprints harbouring TG-linked CpGs majorly overlapped adipocyte-specific open chromatin regions (>70%), indicating that adipocytes are an important cell fraction to consider in epigenomic studies of metabolic diseases. In our second study, we applied a three-way association approach of methylation, gene expression and lipid levels from the MuTHER cohort, which permitted us to identify putative target genes for 76% of lipid-linked regulatory regions. Reflecting current chromatin conformation reports25,162,163, we observed that a large fraction of trait-associated adipose elements showed stronger associations to non-proximal genes (average distance of ~522kb) with most regions depicting putative pleiotropic effects (69%) – including a TG-linked adipose enhancer region associating to cardiometabolic risk-associated loci GNA15 and GNG7. Importantly, we mainly focused our efforts on establishing the relationship between methylation and lipid levels at prior non-interrogated loci and

199 CHAPTER 4: GENERAL DISCUSSION

not on deciphering the causal relationship between these measures. In all, we presented an expanded catalogue of epigenetic variants linked to cardiometabolic risk of interest for future investigations.

200 CHAPTER 5: CONCLUSIONS AND FUTURE DIRECTIONS

Chapter 5: Conclusions and Future Directions

The aim of this doctoral thesis was to broaden our understanding of genetic and epigenetic variants underlying complex traits – focusing on metabolic diseases. In order to counter limitations posed by currently available methylome profiling techniques, we presented MCC-Seq as an alternative approach that enables simultaneous and accurate single-base resolution sequencing of methylation and genotypes at user-defined genomic targets. We performed a proof-of-concept study that validated the method as well as a follow-up study that supports the value of

MCC-Seq in epigenome-wide investigations of complex traits. The customizable platform of MCC-Seq permits the interrogation of active regulatory regions within a tissue of interest. Since the release of MCC-Seq in 2015117, a total of ~4,500 whole blood and adipose tissue samples have been profiled by collaborators or independent groups across different cohorts of disease traits including asthma, T2D, eczema, rheumatoid arthritis, and stroke. In addition, other tissue panels have been designed based on our strategy for sequence selection – including a panel for human sperm.

Multiple papers are currently in preparation or in revision at high impact journals.

These projects indicate that MCC-Seq is becoming accepted in the epigenomics field as a viable alternative for methylome investigation.

We focused our efforts on generating novel epigenome maps of the metabolic disease- linked visceral adipose tissue. Complementing current primary tissue and cell profiling endeavours, we contributed chromatin accessibility (ATAC-Seq), methylome, and transcriptome datasets of VAT and purified adipocytes. More

201 CHAPTER 5: CONCLUSIONS AND FUTURE DIRECTIONS

precisely, using MCC-Seq, we generated methylation profiles for the biggest cohort to date of this tissue type (N=199 VAT samples), enabling us to refine features of the active methylome landscape contributing to disease risk. Obtaining visceral adipose tissue involves an invasive surgical procedure, which limited our cohort size in the presented studies. We believe that applying MCC-Seq on larger multi-ethnic cohorts of biologically-relevant tissues will result in increased power for the detection of complex trait-associated loci. By combining these efforts with replications in bioavailable tissues of longitudinal cohorts, MCC-Seq represents a high-powered tool with potential to provide fully predictive functional biomarkers for clinical applications.

The studies pursued in this doctoral thesis are by no means exhaustive and many other future directions than those stated above are of interest. For instance, although epigenetic signals are known to be cell-specific, most published epigenome-wide studies, including our own, have been instigated in heterogeneous tissues. While cell counts for whole blood samples may be accounted for in statistical models, they cannot be as easily generated for other primary tissue samples such as adipose tissue.

The application of unsupervised deconvolution methods164 in future studies will improve current models and identify cell types contributing to disease. Another interesting area of future exploration involves investigating the contribution of sex- specific factors in complex traits, which we were underpowered to pursue. Multiple studies have shown evidence for human sexual dimorphism in fat distribution, lipid metabolism and the on-set of cardiometabolic diseases159-161. The impact of epigenetic

202 CHAPTER 5: CONCLUSIONS AND FUTURE DIRECTIONS

variants in this context remain to be established. Furthermore, complementary studies of chromatin conformation and genome editing approaches in adipocytes would be a useful resource to further explore and confirm the genomic targets of lipid- linked regulatory regions presented in these studies. Efforts from our group are currently underway to provide reproducible 3D maps of such interactions in adipocytes differentiated from mesenchymal cells of healthy donors.

In summary, we showed the power of using a novel next-generation approach to broaden our understanding of epigenetic variants contributing to complex traits by

(1) designing custom panels that permitted the investigation of distal regulatory regions, which are currently underrepresented in other methylation profiling methods, and (2) focusing on a complex trait-linked tissue to provide added discovery power. We demonstrated that using this approach in combination with integrational studies using added epigenomic layers for the same tissue or cell type under investigation is a valid strategy to provide functional annotation of genetic and epigenetic variants linked to complex disease traits.

203 REFERENCES

References

1 Ritchie, M. D., Holzinger, E. R., Li, R., Pendergrass, S. A. & Kim, D. Methods of integrating data to uncover genotype–phenotype interactions. Nature Reviews Genetics 16, 85 (2015). 2 Willer, C. J. et al. Discovery and refinement of loci associated with lipid levels. Nature Genetics 45, 1274 (2013). 3 Teslovich, T. M. et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707 (2010). 4 Klarin, D. et al. Genetics of blood lipids among ~300,000 multi-ethnic participants of the Million Veteran Program. Nature Genetics 50, 1514-1523, doi:10.1038/s41588-018-0222-9 (2018). 5 Chasman, D. I. et al. Forty-three loci associated with plasma lipoprotein size, concentration, and cholesterol content in genome-wide analysis. PLoS Genetics 5, e1000730 (2009). 6 Albrechtsen, A. et al. Exome sequencing-driven discovery of coding polymorphisms associated with common metabolic phenotypes. Diabetologia 56, 298-310 (2013). 7 Peloso, G. M. et al. Association of low-frequency and rare coding-sequence variants with blood lipids and coronary heart disease in 56,000 whites and blacks. The American Journal of Human Genetics 94, 223-232 (2014). 8 Asselbergs, F. et al. Large-scale gene-centric meta-analysis across 32 studies identifies multiple lipid loci. The American Journal of Human Genetics 91, 823- 838 (2012). 9 WHO. Obesity and overweight, ( 10 González-Muniesa, P. et al. Obesity. Nature Reviews Disease Primers 3, 17034, doi:10.1038/nrdp.2017.34 (2017). 11 Expert Panel on Detection, E. Executive summary of the third report of the National Cholesterol Education Program (NCEP) expert panel on Detection, Evaluation, and Treatment of high blood cholesterol in adults (Adult Treatment Panel III). JAMA: the Journal of the American Medical Association 285, 2486 (2001). 12 Andersen, M. K. & Sandholt, C. H. Recent progress in the understanding of obesity: contributions of genome-wide association studies. Current Obesity Reports 4, 401-410 (2015). 13 Hiuge-Shimizu, A. et al. Absolute value of visceral fat area measured on computed tomography scans and obesity-related cardiovascular risk factors in large-scale Japanese general population (the VACATION-J study). Annals of Medicine 44, 82-92 (2012). 14 Bhupathiraju, S. N. & Hu, F. B. Epidemiology of obesity and diabetes and their cardiovascular complications. Circulation Research 118, 1723-1735 (2016). 15 Stunkard, A. J., Harris, J. R., Pedersen, N. L. & McClearn, G. E. The body-mass index of twins who have been reared apart. New England Journal of Medicine 322, 1483-1487 (1990).

204 REFERENCES

16 Allison, D. B. et al. The heritability of body mass index among an international sample of monozygotic twins reared apart. International Journal of Obesity Related Metabolic Disorders: Journal of the International Association for the Study of Obesity 20, 501-506 (1996). 17 Yang, J. et al. Genetic variance estimation with imputed variants finds negligible missing heritability for human height and body mass index. Nature Genetics 47, 1114-1120 (2015). 18 McCarthy, M. I. Genomics, type 2 diabetes, and obesity. New England Journal of Medicine 363, 2339-2350 (2010). 19 Frayling, T. M. et al. A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science 316, 889-894 (2007). 20 Locke, A. E. et al. Genetic studies of body mass index yield new insights for obesity biology. Nature 518, 197 (2015). 21 Consortium, t. G. et al. Meta-analysis of genome-wide association studies for height and body mass index in 700000 individuals of European ancestry. Human Molecular Genetics 27, 3641-3649 (2018). 22 Shungin, D. et al. New genetic loci link adipose and insulin biology to body fat distribution. Nature 518, 187 (2015). 23 MacArthur, J. et al. The new NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog). Nucleic Acids Research 45, D896-D901 (2016). 24 Frazer, K. A., Murray, S. S., Schork, N. J. & Topol, E. J. Human genetic variation and its contribution to complex traits. Nature Reviews Genetics 10, 241, doi:10.1038/nrg2554 (2009). 25 Consortium, E. P. An integrated encyclopedia of DNA elements in the human genome. Nature 489, 57-74 (2012). 26 Ernst, J. & Kellis, M. Discovery and characterization of chromatin states for systematic annotation of the human genome. Nature Biotechnology 28, 817-825, doi:10.1038/nbt.1662 (2010). 27 Kundaje, A. et al. Integrative analysis of 111 reference human epigenomes. Nature 518, 317 (2015). 28 Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190-1195 (2012). 29 Do, R., Kathiresan, S. & Abecasis, G. R. Exome sequencing and complex disease: practical aspects of rare variant association studies. Human Molecular Genetics 21, R1-R9 (2012). 30 Zuk, O. et al. Searching for missing heritability: designing rare variant association studies. Proceedings of the National Academy of Sciences 111, E455- E464 (2014). 31 Zuk, O., Hechter, E., Sunyaev, S. R. & Lander, E. S. The mystery of missing heritability: Genetic interactions create phantom heritability. Proceedings of the National Academy of Sciences 109, 1193-1198 (2012). 32 Romanoski, C. E., Glass, C. K., Stunnenberg, H. G., Wilson, L. & Almouzni, G. Epigenomics: Roadmap for regulation. Nature 518, 314 (2015).

205 REFERENCES

33 Cheung, V. G. et al. Mapping determinants of human gene expression by regional and genome-wide association. Nature 437, 1365 (2005). 34 Stranger, B. E. et al. Population genomics of human gene expression. Nature Genetics 39, 1217 (2007). 35 Schadt, E. E. et al. Mapping the genetic architecture of gene expression in human liver. PLoS Biology 6, e107 (2008). 36 Nica, A. C. et al. The architecture of gene regulatory variation across multiple human tissues: the MuTHER study. PLoS Genetics 7, e1002003 (2011). 37 Grundberg, E. et al. Mapping cis- and trans-regulatory effects across multiple tissues in twins. Nature Genetics 44, 1084-1089, doi:10.1038/ng.2394 (2012). 38 Chen, L. et al. Genetic drivers of epigenetic and transcriptional variation in human immune cells. Cell 167, 1398-1414. e1324 (2016). 39 GTEx, C. The Genotype-Tissue Expression (GTEx) pilot analysis: multitissue gene regulation in humans. Science 348, 648-660 (2015). 40 Gamazon, E. R. et al. Using an atlas of gene regulation across 44 human tissues to inform complex disease-and trait-associated variation. Nature Genetics 50, 956 (2018). 41 GTEx, C. Genetic effects on gene expression across human tissues. Nature 550, 204 (2017). 42 Goldberg, A. D., Allis, C. D. & Bernstein, E. Epigenetics: a landscape takes shape. Cell 128, 635-638 (2007). 43 Waddington, C. H. Canalization of development and the inheritance of acquired characters. Nature 150, 563 (1942). 44 Tammen, S. A., Friso, S. & Choi, S.-W. Epigenetics: the link between nature and nurture. Molecular Aspects of Medicine 34, 753-764 (2013). 45 Fraga, M. F. et al. Epigenetic differences arise during the lifetime of monozygotic twins. Proceedings of the National Academy of Sciences 102, 10604-10609 (2005). 46 Group, I. S. M. W. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature 409, 928 (2001). 47 Collins, F. S., Morgan, M. & Patrinos, A. The Human Genome Project: lessons from large-scale biology. Science 300, 286-290 (2003). 48 Consortium, E. P. The ENCODE (ENCyclopedia of DNA elements) project. Science 306, 636-640 (2004). 49 Consortium, E. P. A user's guide to the encyclopedia of DNA elements (ENCODE). PLoS Biology 9, e1001046 (2011). 50 Mapping the epigenome. Nature Methods 12, 161, doi:10.1038/nmeth.3315 (2015). 51 Martens, J. H. & Stunnenberg, H. G. BLUEPRINT: mapping human blood cell epigenomes. Haematologica 98, 1487-1489 (2013). 52 Edge, L. A Cornucopia of Advances in Human Epigenomics. Cell 167 (2016). 53 Stunnenberg, H. G. et al. The International Human Epigenome Consortium: a blueprint for scientific collaboration and discovery. Cell 167, 1145-1149 (2016). 54 Gross, D. S. & Garrard, W. T. Nuclease hypersensitive sites in chromatin. Annual Review of Biochemistry 57, 159-197 (1988). 55 Thurman, R. E. et al. The accessible chromatin landscape of the human genome. Nature 489, 75 (2012).

206 REFERENCES

56 Maurano, M. T. et al. Systematic localization of common disease-associated variation in regulatory DNA. Science 337, 1190-1195, doi:10.1126/science.1222794 (2012). 57 Strahl, B. D. & Allis, C. D. The language of covalent histone modifications. Nature 403, 41 (2000). 58 Liang, G. et al. Distinct localization of histone H3 acetylation and H3-K4 methylation to the transcription start sites in the human genome. Proceedings of the National Academy of Sciences 101, 7357-7362 (2004). 59 Heintzman, N. D. et al. Distinct and predictive chromatin signatures of transcriptional promoters and enhancers in the human genome. Nature Genetics 39, 311-318, doi:10.1038/ng1966 (2007). 60 Stadler, M. B. et al. DNA-binding factors shape the mouse methylome at distal regulatory regions. Nature 480, 490-495, doi:10.1038/nature10716 (2011). 61 Burger, L., Gaidatzis, D., Schübeler, D. & Stadler, M. B. Identification of active regulatory regions from DNA methylation data. Nucleic Acids Research 41, e155- e155 (2013). 62 Banovich, N. E. et al. Methylation QTLs are associated with coordinated changes in transcription factor binding, histone modifications, and gene expression levels. PLoS Genetics 10, e1004663 (2014). 63 Kerkel, K. et al. Genomic surveys by methylation-sensitive SNP analysis identify sequence-dependent allele-specific DNA methylation. Nature Genetics 40, 904 (2008). 64 Brinkman, A. B. et al. Sequential ChIP-bisulfite sequencing enables direct genome-scale investigation of chromatin and DNA methylation cross-talk. Genome Research 22, 1128-1138 (2012). 65 Ziller, M. J. et al. Charting a dynamic DNA methylation landscape of the human genome. Nature 500, 477-481 (2013). 66 Grundberg, E. et al. Global Analysis of DNA Methylation Variation in Adipose Tissue from Twins Reveals Links to Disease-Associated Variants in Distal Regulatory Elements. The American Journal of Human Genetics 93, 876-890 (2013). 67 Jones, P. A. Functions of DNA methylation: islands, start sites, gene bodies and beyond. Nature Reviews Genetics 13, 484-492 (2012). 68 Schübeler, D. Function and information content of DNA methylation. Nature 517, 321 (2015). 69 Maunakea, A. K., Chepelev, I., Cui, K. & Zhao, K. Intragenic DNA methylation modulates alternative splicing by recruiting MeCP2 to promote exon recognition. Cell Research 23, 1256 (2013). 70 Maunakea, A. K. et al. Conserved role of intragenic DNA methylation in regulating alternative promoters. Nature 466, 253 (2010). 71 Sandoval, J. et al. Validation of a DNA methylation microarray for 450,000 CpG sites in the human genome. Epigenetics 6, 692-702 (2011). 72 Lienert, F. et al. Identification of genetic elements that autonomously determine DNA methylation states. Nature Genetics 43, 1091 (2011). 73 Feil, R. & Fraga, M. F. Epigenetics and the environment: emerging patterns and implications. Nature Reviews Genetics 13, 97-109 (2012).

207 REFERENCES

74 Bergman, Y. & Cedar, H. DNA methylation dynamics in health and disease. Nature Structural Molecular Biology 20, 274 (2013). 75 Smith, Z. D. & Meissner, A. DNA methylation: roles in mammalian development. Nature Reviews Genetics 14, 204 (2013). 76 Liu, Y. et al. Epigenome-wide association data implicate DNA methylation as an intermediary of genetic risk in rheumatoid arthritis. Nature Biotechnology 31, 142-147 (2013). 77 Dayeh, T. et al. Genome-wide DNA methylation analysis of human pancreatic islets from type 2 diabetic and non-diabetic donors identifies candidate genes that influence insulin secretion. PLoS Genetics 10, e1004160 (2014). 78 Dick, K. J. et al. DNA methylation and body-mass index: a genome-wide analysis. The Lancet (2014). 79 Jiang, C. et al. Disruption of hypoxia-inducible factor 1 in adipocytes improves insulin sensitivity and decreases adiposity in high-fat diet–fed mice. Diabetes 60, 2484-2495 (2011). 80 Zhang, H., Zhang, G., Gonzalez, F. J., Park, S.-m. & Cai, D. Hypoxia-inducible factor directs POMC gene to mediate hypothalamic glucose sensing and energy balance regulation. PLoS Biology 9, e1001112 (2011). 81 Hatanaka, M. et al. Hypoxia-inducible factor-3α functions as an accelerator of 3T3-L1 adipose differentiation. Biological Pharmaceutical Bulletin 32, 1166-1172 (2009). 82 Irvin, M. R. et al. Epigenome-wide association study of fasting blood lipids in the genetics of lipid lowering drugs and diet network study. Circulation, CIRCULATIONAHA. 114.009158 (2014). 83 Pfeifferm, L. et al. DNA methylation of lipid-related genes affects blood lipid levels. Circulation: Cardiovascular Genetics, CIRCGENETICS. 114.000804 (2015). 84 Dekkers, K. F. et al. Blood lipids influence DNA methylation in circulating cells. Genome Biology 17, 138 (2016). 85 Sayols-Baixeras, S. et al. Identification and validation of seven new loci showing differential DNA methylation related to serum lipid profile: an epigenome-wide approach. The REGICOR study. Human Molecular Genetics 25, 4556-4565 (2016). 86 Hedman, Å. K. et al. Epigenetic patterns in blood associated with lipid traits predict incident coronary heart disease events and are enriched for results from genome-wide association studies. Circulation: Cardiovascular Genetics 10, e001487 (2017). 87 Marceau, P. et al. Duodenal switch improved standard biliopancreatic diversion: a retrospective study. Surgery for Obesity and Related Diseases 5, 43-47 (2009). 88 Breitling, L. P., Yang, R., Korn, B., Burwinkel, B. & Brenner, H. Tobacco- smoking-related differential DNA methylation: 27K discovery and replication. American Journal of Human Genetics 88, 450-457, doi:10.1016/j.ajhg.2011.03.003 (2011). 89 Wagner, J. R. et al. The relationship between DNA methylation, genetic and expression inter-individual variation in untransformed human fibroblasts. Genome Biology 15, R37 (2014).

208 REFERENCES

90 Ernst, J. et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature 473, 43-49 (2011). 91 Hodges, E. et al. High definition profiling of mammalian DNA methylation by array capture and single molecule bisulfite sequencing. Genome Research 19, 1593-1605 (2009). 92 Li, H. & Durbin, R. Fast and accurate short read alignment with Burrows– Wheeler transform. Bioinformatics 25, 1754-1760 (2009). 93 Fortin, J.-P. et al. Functional normalization of 450k methylation array data improves replication in large cancer studies. bioRxiv (2014). 94 Moayyeri, A., Hammond, C. J., Hart, D. J. & Spector, T. D. The UK Adult Twin Registry (TwinsUK Resource). Twin Research and Human Genetics : the Official Journal of the International Society for Twin Studies, 1-6, doi:10.1017/thg.2012.89 (2012). 95 Andrew, T. et al. Are twins and singletons comparable? A study of disease- related and lifestyle characteristics in adult women. Twin Research : the Official Journal of the International Society for Twin Studies 4, 464-477 (2001). 96 Buenrostro, J. D., Giresi, P. G., Zaba, L. C., Chang, H. Y. & Greenleaf, W. J. Transposition of native chromatin for fast and sensitive epigenomic profiling of open chromatin, DNA-binding proteins and nucleosome position. Nature Methods (2013). 97 Silverstein, R. L. & Febbraio, M. CD36, a scavenger receptor involved in immunity, metabolism, angiogenesis, and behavior. Science Signaling 2, re3 (2009). 98 Love-Gregory, L. & Abumrad, N. A. CD36 genetics and the metabolic complications of obesity. Current Opinion in Clinical Nutrition and Metabolic care 14, 527 (2011). 99 Rać, M. E., Safranow, K. & Poncyljusz, W. Molecular basis of human CD36 gene mutations. Molecular Medicine 13, 288 (2007). 100 Knøsgaard, L., Thomsen, S., Støckel, M., Vestergaard, H. & Handberg, A. Circulating sCD36 is associated with unhealthy fat distribution and elevated circulating triglycerides in morbidly obese individuals. Nutrition & Diabetes 4, e114 (2014). 101 Coram, M. A. et al. Genome-wide characterization of shared and distinct genetic components that influence blood lipid levels in ethnically diverse human populations. The American Journal of Human Genetics 92, 904-916 (2013). 102 Rakyan, V. K., Down, T. A., Balding, D. J. & Beck, S. Epigenome-wide association studies for common human diseases. Nature Reviews Genetics 12, 529-541 (2011). 103 Alkhatatbeh, M., Enjeti, A., Acharya, S., Thorne, R. & Lincz, L. The origin of circulating CD36 in type 2 diabetes. Nutrition & Diabetes 3, e59 (2013). 104 Shen, Y. et al. A map of the cis-regulatory sequences in the mouse genome. Nature 488, 116-120, doi:10.1038/nature11243 (2012). 105 Johnson, M. D., Mueller, M., Game, L. & Aitman, T. J. Single Nucleotide Analysis of Cytosine Methylation by Whole-Genome Shotgun Bisulfite Sequencing. Current Protocols in Molecular Biology, 21.23. 21-21.23. 28 (2012).

209 REFERENCES

106 Marceau, P. et al. Biliopancreatic diversion with duodenal switch. World Journal of Surgery 22, 947-954 (1998). 107 Vohl, M. C. et al. A Survey of Genes Differentially Expressed in Subcutaneous and Visceral Adipose Tissue in Men. Obesity Research 12, 1217-1222 (2004). 108 Richterich, R. Zur bestimmung der plasmaglukosekonzentration mit der hexokinase-glucose-6-phosphat-dehydrogenase-method. Schweiz Med Wochenschr 101, 615-618 (1971). 109 Liu, Y., Siegmund, K. D., Laird, P. W. & Berman, B. P. Bis-SNP: Combined DNA methylation and SNP calling for Bisulfite-seq data. Genome Biology 13, R61 (2012). 110 Tchernof, A. et al. Regional differences in adipose tissue metabolism in women minor effect of obesity and body fat distribution. Diabetes 55, 1353-1360 (2006). 111 Lohse, M. et al. RobiNA: a user-friendly, integrated software solution for RNA- Seq-based transcriptomics. Nucleic Acids Research, gks540 (2012). 112 Feng, J., Liu, T., Qin, B., Zhang, Y. & Liu, X. S. Identifying ChIP-seq enrichment using MACS. Nature Protocols 7, 1728-1740 (2012). 113 Guenard, F. et al. Association of LIPA gene polymorphisms with obesity-related metabolic complications among severely obese patients. Obesity 20, 2075-2082, doi:10.1038/oby.2012.52 (2012). 114 Turcot, V. et al. LINE-1 methylation in visceral adipose tissue of severely obese individuals is associated with metabolic syndrome status and related phenotypes. Clinical Epigenetics 4, 10, doi:10.1186/1868-7083-4-10 (2012). 115 Trapnell, C., Pachter, L. & Salzberg, S. L. TopHat: discovering splice junctions with RNA-Seq. Bioinformatics 25, 1105-1111 (2009). 116 Langmead, B., Trapnell, C., Pop, M. & Salzberg, S. L. Ultrafast and memory- efficient alignment of short DNA sequences to the human genome. Genome Biology 10, R25, doi:10.1186/gb-2009-10-3-r25 (2009). 117 Allum, F. et al. Characterization of functional methylomes by next-generation capture sequencing identifies novel disease-associated variants. Nature Communications 6 (2015). 118 Kilpinen, H. & Dermitzakis, E. T. Genetic and epigenetic contribution to complex traits. Human Molecular Genetics 21, R24-R28 (2012). 119 Barres, R. & Zierath, J. R. DNA methylation in metabolic disorders. The American Journal of Clinical Nutrition 93, 897S-900S (2011). 120 Gluckman, P. D., Hanson, M. A., Buklijas, T., Low, F. M. & Beedle, A. S. Epigenetic mechanisms that underpin metabolic and cardiovascular diseases. Nature Reviews Endocrinology 5, 401-408 (2009). 121 Elder, S. J. et al. Genetic and environmental influences on factors associated with cardiovascular disease and the metabolic syndrome. Journal of Lipid Research 50, 1917-1926 (2009). 122 Mathers, J. C., Strathdee, G. & Relton, C. L. Induction of epigenetic alterations by dietary and other environmental factors. Advances in Genetics; Herceg, Z., Ushijima, T., Eds, 1-39 (2010). 123 Eichler, E. E. et al. Missing heritability and strategies for finding the underlying causes of complex disease. Nature Reviews Genetics 11, 446-450 (2010).

210 REFERENCES

124 Wildman, R. P. et al. The obese without cardiometabolic risk factor clustering and the normal weight with cardiometabolic risk factor clustering: prevalence and correlates of 2 phenotypes among the US population (NHANES 1999-2004). Archives of Internal Medicine 168, 1617-1624 (2008). 125 Braun, K. V. et al. Epigenome-wide association study (EWAS) on lipids: the Rotterdam Study. Clinical Epigenetics 9, 15 (2017). 126 Cheung, W. A. et al. Functional variation in allelic methylomes underscores a strong genetic contribution and reveals novel epigenetic alterations in the human epigenome. Genome Biology 18, 50 (2017). 127 van Iterson, M., van Zwet, E. W. & Heijmans, B. T. Controlling bias and inflation in epigenome-and transcriptome-wide association studies using the empirical null distribution. Genome Biology 18, 19 (2017). 128 Busche, S. et al. Population whole-genome bisulfite sequencing across two tissues highlights the environment as the principal source of human methylome variation. Genome Biology 16, 290 (2015). 129 Richard, A. J. & Stephens, J. M. The role of JAK–STAT signaling in adipose tissue function. Biochimica et Biophysica Acta (BBA)-Molecular Basis of Disease 1842, 431-439 (2014). 130 Zhao, P. & Stephens, J. M. Identification of STAT target genes in adipocytes. Jak-Stat 2, e23092 (2013). 131 Stephens, J. M., Morrison, R. F., Wu, Z. & Farmer, S. R. PPARγ ligand- dependent induction of STAT1, STAT5A, and STAT5B during adipogenesis. Biochemical and Biophysical Research Communications 262, 216-222 (1999). 132 Kaltenecker, D. et al. Adipocyte STAT5 deficiency promotes adiposity and impairs lipid mobilisation in mice. Diabetologia 60, 296-305 (2017). 133 Priceman, S. J. et al. Regulation of adipose tissue T cell subsets by Stat3 is crucial for diet-induced obesity and insulin resistance. Proceedings of the National Academy of Sciences 110, 13079-13084 (2013). 134 Miura, S. et al. Nuclear factor 1 regulates adipose tissue-specific expression in the mouse GLUT4 gene. Biochemical and Biophysical Research Communications 325, 812-818 (2004). 135 Kadowaki, T. et al. in Cold Spring Harbor symposia on quantitative biology. 257-265 (Cold Spring Harbor Laboratory Press). 136 Hou, X. et al. CDK6 inhibits white to beige fat transition by suppressing RUNX1. Nature Communications 9, 1023 (2018). 137 Lu, F. & Liu, Q. Validation of RUNX1 as a potential target for treating circadian clock-induced obesity through preventing migration of group 3 innate lymphoid cells into intestine. Medical Hypotheses 113, 98-101 (2018). 138 Bell, J. T. et al. DNA methylation patterns associate with genetic and gene expression variation in HapMap cell lines. Genome Biology 12, R10, doi:10.1186/gb-2011-12-1-r10 (2011). 139 Smith, N. L. et al. Association of genome-wide variation with the risk of incident heart failure in adults of European and African ancestry: a prospective meta- analysis from the cohorts for heart and aging research in genomic epidemiology (CHARGE) consortium. Circulation: Genomic and Precision Medicine 3, 256-266 (2010).

211 REFERENCES

140 Palmer, N. D. et al. Genetic variants associated with quantitative glucose homeostasis traits translate to type 2 diabetes in Mexican Americans: the GUARDIAN (Genetics Underlying Diabetes in Hispanics) Consortium. Diabetes 64, 1853-1866 (2015). 141 Divers, J. et al. Genome-wide association study of coronary artery calcified atherosclerotic plaque in African Americans with type 2 diabetes. BMC Genetics 18, 105 (2017). 142 Kristiansson, K. et al. Genome-Wide Screen for Metabolic Syndrome Susceptibility Loci Reveals Strong Lipid Gene Contribution But No Evidence for Common Genetic Basis for Clustering of Metabolic Syndrome TraitsCLINICAL PERSPECTIVE. Circulation: Genomic and Precision Medicine 5, 242-249 (2012). 143 Li, Q. et al. Association of the GALNT2 gene polymorphisms and several environmental factors with serum lipid levels in the Mulao and Han populations. Lipids in Health and Disease 10, 160 (2011). 144 Marucci, A. et al. GALNT2 expression is reduced in patients with Type 2 diabetes: possible role of hyperglycemia. PLoS One 8, e70159 (2013). 145 Newton-Cheh, C. et al. Genome-wide association study identifies eight loci associated with blood pressure. Nature Genetics 41, 666 (2009). 146 Demerath, E. W. et al. Epigenome-wide association study (EWAS) of BMI, BMI change and waist circumference in African American adults identifies multiple replicated loci. Human Molecular Genetics 24, 4464-4479 (2015). 147 Eisenstein, A. & Ravid, K. G Protein-Coupled Receptors and Adipogenesis: A Focus on Adenosine Receptors. Journal of Cellular Physiology 229, 414-421 (2014). 148 Liu, L. & Clipstone, N. A. Prostaglandin F2α inhibits adipocyte differentiation via a Gαq-Calcium-Calcineurin-Dependent signaling pathway. Journal of Cellular Biochemistry 100, 161-173 (2007). 149 Storey, J. D. & Tibshirani, R. Statistical significance for genomewide studies. Proceedings of the National Academy of Sciences 100, 9440-9445 (2003). 150 Wickham, H. ggplot2: elegant graphics for data analysis. (Springer, 2016). 151 Heinz, S. et al. Simple combinations of lineage-determining transcription factors prime cis-regulatory elements required for macrophage and B cell identities. Molecular Cell 38, 576-589 (2010). 152 Dobin, A. et al. STAR: ultrafast universal RNA-seq aligner. Bioinformatics 29, 15-21 (2013). 153 Love, M. I., Huber, W. & Anders, S. Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2. Genome Biology 15, 550 (2014). 154 Bates, D., Mächler, M., Bolker, B. & Walker, S. Fitting linear mixed-effects models using lme4. arXiv preprint arXiv:1406.5823 (2014). 155 Small, K. S. et al. Identification of an imprinted master trans regulator at the KLF14 locus related to multiple metabolic phenotypes. Nature Genetics 43, 561 (2011). 156 Allum, F. et al. Dissecting features of epigenetic variants underlying cardiometabolic risk using full-resolution epigenome profiling in regulatory elements. Nature Communications 10, 1209, doi:10.1038/s41467-019-09184-z (2019).

212 REFERENCES

157 Varshney, A. et al. Genetic regulatory signatures underlying islet gene expression and type 2 diabetes. Proceedings of the National Academy of Sciences 114, 2301-2306 (2017). 158 Scott, L. J. et al. The genetic regulatory signature of type 2 diabetes in human skeletal muscle. Nature Communications 7, 11764 (2016). 159 Poulsen, P., Vaag, A., Kyvik, K. & Beck-Nielsen, H. Genetic versus environmental aetiology of the metabolic syndrome among male and female twins. Diabetologia 44, 537-543 (2001). 160 Varlamov, O., Bethea, C. L. & Roberts Jr, C. T. Sex-specific differences in lipid and glucose metabolism. Frontiers in Endocrinology 5, 241 (2015). 161 Lovejoy, J., Sainsbury, A. & Group, S. C. W. Sex differences in obesity and the regulation of energy homeostasis. Obesity Reviews 10, 154-167 (2009). 162 Sahlén, P. et al. Genome-wide mapping of promoter-anchored interactions with close to single-enhancer resolution. Genome Biology 16, 156 (2015). 163 Mifsud, B. et al. Mapping long-range promoter contacts in human cells with high-resolution capture Hi-C. Nature Genetics 47, 598 (2015). 164 Lutsik, P. et al. MeDeCom: discovery and quantification of latent components of heterogeneous methylomes. Genome Biology 18, 55 (2017).

213 APPENDICES

Appendices

Appendix A: Significant contributions to other publications

1. Busche S, Shao X, Caron M, Kwan T, Allum F, Cheung WA, Ge B, Westfall S, Simon MM, The Multiple Tissue Expression Resource, Barrett A, Bell JT, McCarthy MI, Deloukas P, Blanchette M, Bourque G, Spector TD, Lathrop M, Pastinen T, Grundberg E. Population whole-genome bisulfite sequencing across two tissues highlights the environment as the principal source of human methylome variation. Genome Biology 16:290 (2015). doi: 10.1186/s13059-015- 0856-1

Abstract Background: CpG methylation variation is involved in human trait formation and disease susceptibility. Analyses within populations have been biased towards CpG-dense regions through the application of targeted arrays. We generate whole-genome bisulfite sequencing data for approximately 30 adipose and blood samples from monozygotic and dizygotic twins for the characterization of non-genetic and genetic effects at single-site resolution.

Results: Purely invariable CpGs display a bimodal distribution with enrichment of unmethylated CpGs and depletion of fully methylated CpGs in promoter and enhancer regions. Population-variable CpGs account for approximately 15–20 % of total CpGs per tissue, are enriched in enhancer-associated regions and depleted in promoters, and single nucleotide polymorphisms at CpGs are a frequent confounder of extreme methylation variation. Differential methylation is primarily non-genetic in origin, with non-shared environment accounting for most of the variance. These non-genetic effects are mainly tissue-specific. Tobacco smoking is associated with differential methylation in blood with no evidence of this exposure impacting cell counts. Opposite to non-genetic effects,

214 APPENDICES

genetic effects of CpG methylation are shared across tissues and thus limit inter-tissue epigenetic drift. CpH methylation is rare, and shows similar characteristics of variation patterns as CpGs.

Conclusions: Our study highlights the utility of low pass whole-genome bisulfite sequencing in identifying methylome variation beyond promoter regions, and suggests that targeting the population dynamic methylome of tissues requires assessment of understudied intergenic CpGs distal to gene promoters to reveal the full extent of inter-individual variation.

Description of contributions For this work, I generated hypomethylated footprints from WGBS data for adipose and whole blood tissues, identifying tissue-shared and unique putative regulatory regions. I further performed downstream analyses to characterize these low-methylated (LMR) and unmethylated (UMR) regions, integrating additional epigenetic data available for these tissues from the Roadmap consortium. I contributed relevant text for the manuscript. The methylation footprints identified for this study were subsequently used in the 2 publications presented in the body of this work and were incorporated in a custom metabolic disease targeted panel commercialized by Roche NimbleGen as SeqCap Epi Developer XL Design #131010_HG19_EG_met_EPI.

2. Love-Gregory L, Kraja AT, Allum F, Aslibekyan S, Hedman AK, Duan Y, Borecki IB, Arnett DK, McCarthy MI, Deloukas P, Ordovas JM, Hopkins PN, Grundberg E, Abumrad NA. Higher chylomicron remnants and LDL particle numbers associate with CD36 SNPs and DNA methylation sites that reduce CD36. Journal of Lipid Research 57: 2176-2184 (2016). doi: 10.1194/jlr.P065250

215 APPENDICES

Abstract Cluster of differentiation 36 (CD36) variants influence fasting lipids and risk of metabolic syndrome, but their impact on postprandial lipids, an independent risk factor for cardiovascular disease, is unclear. We determined the effects of SNPs within a 410 kb region encompassing CD36 and its proximal and distal promoters on chylomicron (CM) remnants and LDL particles at fasting and at 3.5 and 6 h following a high-fat meal (Genetics of Lipid Lowering Drugs and Diet Network study, n = 1,117). Five promoter variants associated with CMs, four with delayed TG clearance and five with LDL particle number. To assess mechanisms underlying the associations, we queried expression quantitative trait loci, DNA methylation, and ChIP-seq datasets for adipose and heart tissues that function in postprandial lipid clearance. Several SNPs that associated with higher serum lipids correlated with lower adipose and heart CD36 mRNA and aligned to active motifs for PPARγ, a major CD36 regulator. The SNPs also associated with DNA methylation sites that related to reduced CD36 mRNA and higher serum lipids, but mixed model analyses indicated that the SNPs and methylation independently influence CD36 mRNA. The findings support contributions of CD36 SNPs that reduce adipose and heart CD36 RNA expression to inter-individual variability of postprandial lipid metabolism and document changes in CD36 DNA methylation that influence both CD36 expression and lipids.

Description of contributions This study was a follow-up collaboration with the Abumrad group (Washington University, Missouri, USA) on the metabolically-linked CD36 locus after the release of our publication in 2015117 (Chapter 2), where we focused on characterizing a putative lipid-linked enhancer region intragenic to this gene. Using ~1000 individuals from the GOLDN cohort, this group identified CD36 SNPs associated with lipid traits (i.e. chylomicron remnants, TG clearance and/or LDL particle number) and further with CD36 expression. I investigated

216 APPENDICES

these SNPs within the large MuTHER cohort (N~600) by performing integrational analyses using methylation, expression and genotype datasets. I identified that two SNPs of interest further associated with DNA methylation sites that in turn correlated with reduced CD36 mRNA and higher plasma lipid levels. By performing mixed-model analyses, I showed that both SNPs and methylation impact CD36 expression levels. During this year-long collaboration, I provided multiple analyses including the ones described above. I provided text related to these analyses for inclusion into the manuscript as well as edited the final manuscript.

3. Cheung WA, Shao X, Morin A, Siroux V, Kwan T, Ge B, Aïssi D, Chen L, Vasquez L, Allum F, Guénard F, Bouzignon E, Simon MM, Boulier E, Redensek A, Watt S, Datta A, Clarke L, Flicek P, Mead D, Paul DS, Beck S, Bourque G, Lathrop M, Tchernof A, Vhol MC, Demenais F, Pin I, Downes K, Stunnenberg HG, Soranzo N, Pastinen T, Grundberg E. Functional variation in allelic methylomes underscores a strong genetic contribution and reveals novel epigenetic alterations in the human epigenome. Genome Biology 18:50 (2017). doi: 10.1186/s13059-017-1173-7

Abstract Background: The functional impact of genetic variation has been extensively surveyed, revealing that genetic changes correlated to phenotypes lie mostly in non-coding genomic regions. Studies have linked allele-specific genetic changes to gene expression, DNA methylation, and histone marks but these investigations have only been carried out in a limited set of samples.

Results: We describe a large-scale coordinated study of allelic and non-allelic effects on DNA methylation, histone mark deposition, and gene expression, detecting the interrelations between epigenetic and functional features at unprecedented resolution. We use information from whole genome and targeted

217 APPENDICES

bisulfite sequencing from 910 samples to perform genotype-dependent analyses of allele-specific methylation (ASM) and non-allelic methylation (mQTL). In addition, we introduce a novel genotype-independent test to detect methylation imbalance between . Of the ~2.2 million CpGs tested for ASM, mQTL, and genotype-independent effects, we identify ~32% as being genetically regulated (ASM or mQTL) and ~14% as being putatively epigenetically regulated. We also show that epigenetically driven effects are strongly enriched in repressed regions and near transcription start sites, whereas the genetically regulated CpGs are enriched in enhancers. Known imprinted regions are enriched among epigenetically regulated loci, but we also observe several novel genomic regions (e.g., HOX genes) as being epigenetically regulated. Finally, we use our ASM datasets for functional interpretation of disease-associated loci and show the advantage of utilizing naïve T cells for understanding autoimmune diseases.

Conclusions: Our rich catalogue of haploid methylomes across multiple tissues will allow validation of epigenome association studies and exploration of new biological models for allelic exclusion in the human genome.

Description of contributions The method implemented in Chapter 2 (i.e. MCC-seq) was applied to 859 samples from the following tissue and cell sources; T cell (Cambridge NIHR BioResource), visceral adipose tissue (IUCPQ), whole blood tissue (EGEA) and monocyte (Uppsala Bioresource). The visceral adipose tissue data was specifically contributed from our previous publication (Chapter 2). The use of MCC-Seq across these cohorts allowed for the study of the relationship between allelic-specific methylation, gene expression levels and other layers of epigenetic marks. Additionally, I provided edits to the manuscript.

218 APPENDICES

Appendix B: Copyright Permissions

The manuscript and related materials reported in Chapter 2 and 3 were originally published in Nature Communications. As per of the Nature journal policies on reprints and permissions: “The author of articles published by SpringerNature do not usually need to seek permission for the re-use of their material as long as the journal is credited with initial publication.” (https://www.nature.com/reprints/permission- requests.html). The original open access contents are available online:

Chapter 2: doi: 10.1038/ncomms8211 Chapter 3: doi: 10.1038/s41467-019-09184-z

219 APPENDICES

Appendix C: Ethics and Related Certificates

All participants from the cohorts presented in these studies provided written informed consent before enrolment in the studies. Approvals by the ethics committees are presented in the individual manuscripts and reproduced here for the reader.

For the IUCPQ cohort in Chapters 2 and 3, the sample collection was approved by the Université Laval and McGill University (IRB FWA00004545) ethics committee and performed in accordance with the principles of the Declaration of Helsinki. Tissue banking and the severely obese cohort were approved by the research ethics committees of the Quebec Heart and Lung Institute.

For the CARTaGENE cohort in Chapter 3, the methylation studies of the samples from CARTaGENE were approved by the McGill University institutional review board, IRB number A04-M46-12B.

220