INTEGRATIVE METHODS FOR THE ANALYSIS OF GENOME WIDE ASSOCIATION STUDIES

A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Marc A. Schaub June 2012

© 2012 by Marc Andreas Schaub. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/qt820xd3631

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Serafim Batzoglou, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Atul Butte

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

David Dill

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii Abstract

Genome Wide Association Studies (GWAS) have identified over 4,500 common vari- ants in the that are statistically associated with diseases and other phenotypical traits. Most identified associations, however, only have a small effect on disease risk, and their relevance in a clinical setting remains the subject of extensive debate. In this thesis I present three integrative analysis directions that extend on GWAS by developing new methods, by using genotyping data to ask new questions, and by integrating additional types of data to generate functional hypotheses about the biological processes underlying associations. First, I introduce a new classifier-based methodology that identifies similarities in the genetic architecture of diseases. This method can successfully identify both known and novel relationships between common diseases such as type 1 diabetes, rheumatoid arthritis, hypertension and bipolar disease. Second, I show how control individuals from a GWAS can be used to detect genetic differences between the pseudoautosomal regions of X and Y in the general population, which can be attributed to differences in allele frequency between the two sex chromosomes likely caused by selective pressure. Finally, I present an approach that integrates experimental data generated by the ENCODE consortium in order to identify functional Single Polymorphisms (SNPs). These functional SNPs are associated with a phenotype, either directly or through linkage disequilibrium, and overlap a functional part of the genome such as a transcribed region or a binding site. GWAS associations are significantly enriched for functional annotations, and up to 80% of all associations

iv previously reported in a GWAS can be mapped to a functional SNP. For most asso- ciations the functional SNP most strongly supported by experimental evidence is a SNP in linkage disequilibrium with the reported association rather then the reported SNP itself.

v Acknowledgment

I would like to thank my advisor Serafim Batzoglou for his advice, feedback, encour- agements and support throughout my graduate career, for giving me the freedom to explore a broad range of research directions, and for bringing together such a truly outstanding group of researchers. In particular, I would like to thank George Asi- menos and Chuong Do for all their highly valuable advice early in my graduate career, Anshul Kundaje for his advice and support during the second half of my thesis work, and Irene Kaplow for having been such a fantastic summer student. I would like to thank Atul Butte for all the feedback and encouragement through- out my thesis work, for inviting me to his group meetings, for serving on my qualifying examination, defense and reading committees, and for giving me the privilege of col- laborating closely with two amazing Ph.D. students in his group, Marina Sirota and Linda Liu. Working with Marina and Linda has certainly been the favorite part of my research work at Stanford, and I’m very grateful for everything they have done to make these joint projects such a successful and tremendously enriching experience. I would like to thank Michael Snyder for giving me the opportunity to collaborate with his group on the analysis of the ENCODE data, for his advice and support, and for serving on my defense committee, Ross Hardison for his support of my work on linking ENCODE and GWAS data, David Dill for serving on my qualifying exam, defense and reading committees and Arend Sidow for chairing my thesis defense. I would like to thank my friends and colleagues in the Batzoglou, Butte and Snyder labs and in the Stanford biomedical research community for their support, encour- agements, advice, feedback, and the many fruitful discussions we had about research and life in general: Sarah Aerni, Andy Beck, Sivan Bercovici, Alan Boyle, Leticia

vi Britos, David Chen, Rong Chen, Tiffany Chen, Annie Chiang, Erik Corona, Michelle Davison, Eugene Davydov, Omkar Deshpande, Joel Dudley, Robert Edgar, Megan El- more, Sangeeta English, Patrick Flaherty, Jason Flannick, Chuan Sheng Foo, Eugene Fratkin, Yael Garten, Andrew Gentles, Sam Gross, Adam Grossman, Philip Guo, Manoj Hariharan, Lin Huang, Nadine Hussami, Robert Ikeda, Konrad Karczewski, Dorna Kashef-Haghighi, Peter Kang, Purvesh Khatri, Keiichi Kodama, Andy Kogel- nik, Sofia Kyriazopoulou-Panagiotopoulou, Wei-Nchih Lee, Daniel Li, Li Li, Max Libbrecht, Irene Liu, Yuling Liu, Alex Morgan, Daniel Newburger, Tony Novak, Jon Palma, Chirag Patel, Victoria Popic, Yannick Pouliot, Dmitry Pushkarev, Jesse Rodriguez, Jon Rodriguez, David Ruau, Olga Russakovsky, Karen Sachs, Raheleh Salari, Nicelio Sanchez-Luege, Shai Shen-Orr, Andreas Sundquist, Silpa Suthram, Nick Tatonetti, Rob Tirrell, Shivkumar Venkatasubrahmanyam, Dan Webster and Noah Zimmerman. My research work would not have been possible without the outstanding techni- cal and administrative support of Miles Davis, Kathi DiTommaso, Sebastian Gutier- rez, Alex Sandra Pinedo, Alex Skrenchuk, Tanya Raschke, Liliana Rivera and Verna Wong. During my time at Stanford, I had the privilege of being involved in a broad range of extracurricular activities. I would like to thank all my friends in Stan- ford EMS, and in particular Florian Schmitzberger, Chris Cheung, Brian Cheung, Glenn Ulansey, Lauren Mamer, Mark Liao and James Liao, the teaching staff of the Stanford EMT program, the Stanford Wilderness Medicine instructor team, and the Escondido Village Community Associates for their friendship, encouragements and support throughout my graduate career, and for just being an amazing group of peo- ple! These programs tremendously enriched my experience at Stanford, and would not have been possible without the support of the Department of Public Safety, the Division of Emergency Medicine and Stanford Outdoor Education. I would like to thank the Graduate Life Office, and in particular Ken Hsu, Laurette Beeson and Anne Boswell for their support of my work as a Community Associate in Studio 2, and all the great work they do to assist the Stanford graduate student community in general.

vii While many miles away, my friends and family in Switzerland, and in particular Fr´ed´eric Ev´equoz,Gr´egoryMermoud´ and Gr´egoryTh´eoduloz as well as my brother Alain have always been very supportive of my work. Finally, none of this would have been possible without the unwavering support of my family throughout my entire career. I am deeply grateful to my father Andreas and my mother Margrith for everything they have done in order to give me the opportunity to follow my interests, and for always encouraging me to do so, even when it meant living nine timezones away from home. Danke viel, vielmals f¨uralles!

Joint Work

Chapter 3 and Sections 2.1 and 2.2 of Chapter 2 are a reproduction, in part, of a previously published article: M.A. Schaub, I.M. Kaplow, M. Sirota, C.B. Do, A.J. Butte, S. Batzoglou. A Classifier-based Approach to Identify Genetic Similarities Between Diseases. Bioin- formatics 25: i21-29. 2009. I would like to thank my co-authors Irene M. Kaplow, Marina Sirota, Chuong B. Do, Atul J. Butte and Serafim Batzoglou for their contributions to this project. I conceived and designed the study, performed all data preprocessing, implemented the version of the decision tree classifier used to obtain the reported results, analyzed the data and wrote the manuscript. Irene M. Kaplow performed exploratory research comparing various classifiers, which lead to the choice of the Decision Tree classifier we used. Marina Sirota revised Figure 3.1, and designed the version shown herein. Chuong B. Do and Marina Sirota provided input and feedback on the study design and data analysis. Atul J. Butte and Serafim Batzoglou helped conceive the study and supervised the study. All authors revised the manuscript.

Chapter 4 and Section 2.3 of Chapter 2 represent joint work that will become part of a manuscript to be submitted after the time of submission of this thesis. I would like to thank my co-authors on this upcoming manuscript Linda Y. Liu, Marina Sirota, Serafim Batzoglou and Atul J. Butte for their contributions to this

viii project. Linda Y. Liu and I jointly conceived and designed the study. Linda Y. Liu performed the analysis on the WTCCC data set. I performed the analysis on the HapMap 3 dataset, developed the modified Hardy-Weinberg model, identified the sequence issue leading to false positives in autosomes, and wrote the chapter. Marina Sirota provided input and feedback on the study design and data analysis. Atul J. Butte helped conceive the study. Serafim Batzoglou and Atul J. Butte supervised the study.

Chapters 5 and6 are a reproduction, in part, of a research article which, at the time of submission of this thesis, has been accepted for publication: M.A. Schaub, A.P. Boyle, A. Kundaje, S. Batzoglou, M.P. Snyder. Linking Disease Associations with Regulatory Information in the Human Genome. Genome Research. In press. I would like to thank my co-authors Alan P. Boyle, Anshul Kundaje, Serafim Batzoglou and Michael P. Snyder for their contributions to this project. I conceived and designed the study, performed all data analysis steps and wrote the manuscript. Alan P. Boyle designed RegulomeDB and provided regulatory annotations for the list of all SNPs used in this work. Anshul Kundaje, Serafim Batzoglou and Michael P. Snyder helped conceive the study and supervised the study. All authors revised the manuscript.

Funding

This work would not have been possible without the very generous donators who sup- ported me through the Richard and Naomi Horowitz Stanford Graduate Fellowship, and a School of Engineering Fellowship. Parts of this work have been supported by the ENCODE consortium under Grant No. NIH 5U54 HG 004558, by the National Science Foundation under Grant No. 0640211, and by a King Abdullah University of Science and Technology research grant.

ix Data

This study makes use of data generated by the Wellcome Trust Case-Control Con- sortium. A full list of the investigators who contributed to the generation of the data is available from www.wtccc.org.uk. Funding for the project was provided by the Wellcome Trust under award 076113. This work makes use of data generated and processed by the ENCODE consor- tium, the Office of Population Genomics at the National Human Genome Research Institute, the HapMap consortium, and the Genome Bioinformatics Group at the University of California Santa Cruz.

x Contents

Abstract iv

Acknowledgment vi

1 Introduction 1 1.1 Genome wide association studies ...... 2 1.2 Integrative analysis methods ...... 5

2 Data Quality 8 2.1 The Wellcome Trust Case Control Consortium data set ...... 8 2.2 Genotype calling artifacts ...... 9 2.2.1 Genotype calling ...... 10 2.2.2 Examples of artifacts ...... 10 2.2.3 Consensus approach ...... 12 2.3 Effect of homology with the sex chromosomes ...... 14

3 Identifying Similarities Between Diseases 19 3.1 Introduction ...... 19 3.2 Approach ...... 22 3.3 Results ...... 26 3.3.1 Classifier performance ...... 26 3.3.2 Disease similarities ...... 28 3.3.3 Differences between control sets ...... 31 3.4 Discussion ...... 34

xi 3.5 Conclusion ...... 38 3.6 Methods ...... 39 3.6.1 Classification task ...... 39 3.6.2 Decision trees ...... 39 3.6.3 Identifying similarities ...... 42

4 Analysis of the Pseudoautosomal Regions 43 4.1 Introduction ...... 43 4.2 Results ...... 45 4.2.1 Significant differences between males and females in the pseu- doautosomal region 1 ...... 45 4.2.2 Replication ...... 46 4.2.3 Modified Hardy-Weinberg model ...... 48 4.2.4 Phasing ...... 50 4.2.5 Evolution of differing allele frequencies ...... 50 4.3 Discussion ...... 52 4.4 Conclusion ...... 56 4.5 Methods ...... 56 4.5.1 Identifying differences between males and females ...... 56 4.5.2 Trio-based phasing in PARs ...... 57

5 Integrating Regulatory Information 60 5.1 Introduction ...... 60 5.2 Results ...... 63 5.2.1 Lead SNP annotation ...... 65 5.2.2 Linkage disequilibrium ...... 65 5.2.3 Integrating expression data ...... 69 5.2.4 SNP comparison within linkage disequilibrium regions . . . . . 69 5.2.5 Associations are enriched for regulatory elements ...... 71 5.2.6 Analysis at the phenotype level ...... 80 5.3 Discussion ...... 83

xii 5.3.1 Identifying functional SNPs in linkage disequilibrium with lead SNPs ...... 85 5.3.2 Comparison of functional assays ...... 85 5.3.3 Differences between tissue types ...... 86 5.3.4 Functional SNPs beyond reported associations ...... 87 5.3.5 Analysis at the at the phenotype level ...... 88 5.4 Conclusion ...... 88 5.5 Data ...... 89 5.5.1 GWAS catalog ...... 89 5.5.2 Linkage disequilibrium ...... 90 5.5.3 Genotyping arrays ...... 90 5.5.4 SNP properties ...... 91 5.5.5 Functional annotations ...... 91 5.5.6 Transcribed regions ...... 92 5.6 Methods ...... 93 5.6.1 Lead SNP annotation ...... 93 5.6.2 Linkage disequilibrium integration ...... 93 5.6.3 Randomization ...... 95 5.6.4 Analysis at the phenotype level ...... 99

6 Analysis of Functional SNPs 100 6.1 Strongly supported functional SNPs ...... 100 6.2 Replication of a previously validated functional SNP ...... 101 6.3 A new functional SNP for type 2 diabetes ...... 102 6.3.1 Results ...... 102 6.3.2 Discussion ...... 105 6.4 The 9p21 region in coronary artery disease ...... 106 6.4.1 Results ...... 108 6.4.2 Discussion ...... 110 6.5 Methods ...... 114

Bibliography 115

xiii List of Tables

2.1 Incorrect and correct genotype calls for rs2241572 ...... 12 2.2 Incorrect genotype calls for rs2491853 ...... 16

3.1 Classifier performance (cross-validation) ...... 26 3.2 Separate training set classifier performance ...... 32

4.1 Significant genotype differences between males and females in WTCCC controls ...... 46 4.2 Analysis of rs312258 in WTCCC disease populations ...... 47 4.3 Genotype counts for rs312258 in HapMap 3 ...... 47 4.4 Comparison of observed and estimated genotype counts ...... 48 4.5 Allelle counts per for rs312258 in HapMap 3 ...... 51 4.6 Phasing cases in trios ...... 59

5.1 Fraction of associations overlapping functional regions for different linkage disequilibrium thresholds...... 67 5.2 Fraction of associations overlapping functional regions for different linkage disequilibrium thresholds (European populations)...... 68 5.3 Comparison of functional evidence between the lead SNP and the best SNP in the linkage disequilibrium region...... 70 5.4 Overview of enrichment: lead SNPs only, all populations...... 75 5.5 Overview of enrichment: perfect LD, all populations...... 75 5.6 Overview of enrichment: r2 ≥ 0.9, all populations...... 76 5.7 Overview of enrichment: r2 ≥ 0.8, all populations...... 76

xiv 5.8 Overview of enrichment: r2 ≥ 0.5, all populations...... 77 5.9 Overview of enrichment: lead SNPs only, European populations. . . . 77 5.10 Overview of enrichment: perfect LD, European populations...... 78 5.11 Overview of enrichment: r2 ≥ 0.9, European populations...... 78 5.12 Overview of enrichment: r2 ≥ 0.8, European populations...... 79 5.13 Overview of enrichment: r2 ≥ 0.5, European populations...... 79 5.14 Height-associated functional SNPs overlapping CTCF binding sites . 82 5.15 Prostate cancer-associated functional SNPs overlapping AR binding sites 83 5.16 Modified RegulomeDB scoring scheme...... 92

6.1 Overview of the lead SNPs most strongly supported by functional ev- idence...... 101 6.2 Strongly supported functional SNPs in linkage disequilibrium with an associated lead SNP in all populations...... 102 6.3 Strongly supported functional SNPs in linkage disequilibrium with an associated lead SNP in the European population...... 103

xv List of Figures

1.1 Overview of a genome wide association study ...... 4

2.1 Overview of genotype calling ...... 11 2.2 Genotype calling for rs2241572 ...... 12 2.3 Genotyping chip signal intensities for rs312258 in controls ...... 14 2.4 Effect of homology on genotype calling ...... 15 2.5 Signal intensities for rs2491853 ...... 17

3.1 Overview of the approach...... 23 3.2 Distribution of the disease-class probabilities for the type 1 diabetes classifier...... 29 3.3 Disease-class probabilities comparisons...... 30 3.4 Distribution of the class probabilities for the control-control classifier 33

4.1 Manhattan plot of differences between males and females in PAR1 . . 45

5.1 Schematic overview of the functional SNP approach...... 64 5.2 Proportions of associations for different types of functional data. . . . 66 5.3 Enrichment for different combinations of assays...... 71 5.4 Overview of enrichment...... 74 5.5 Phenotype level overview of the overlap between associations and ChIP- seq binding...... 81

6.1 Functional SNP rs7163757 ...... 104 6.2 Overview of the 9p21 region ...... 107

xvi 6.3 Evidence supporting the implication of rs1333047 in coronary artery disease...... 109

xvii Chapter 1

Introduction

The completion of the landmark effort in sequencing the human genome over ten years ago [1, 2] has had a profound impact on disease genetics research [3]. The availability of the full sequence of the human genome lead to the identification of increasingly complete lists of common variation between individuals [4]. The most common form of variation is called a Single Nucleotide Polymorphism (SNP): a position in the genome where individuals differ at exactly one , but the flanking sequence on both sides of the SNP is identical in the population [5]. For most SNPs individuals have one of two base pairs: the major allele, which is the most prevalent in the population, and the minor allele. As humans are diploid, each individual will have two alleles at a given SNP: one on the maternal copy of the chromosome, and one on the paternal copy. These two alleles are combined to form the genotype of the individual at that SNP. SNPs for which the minor allele is relatively frequent (a commonly used threshold is 5% of the population) are called common SNPs. The HapMap project [6, 7] lead to the identification of over 3.8 million common SNPs. This large catalog of variation combined with cost-effective genotyping technologies [8], which allow the measurement of hundreds of thousands of SNPs per individual, made it possible to study how populations differ at the genetic level. In the context of diseases, it became possible to compare a population of individuals who suffer from a given disease to the general population [9]. This idea forms the basis of Genome Wide Association Studies (GWAS). A landmark GWAS was published by the Wellcome

1 CHAPTER 1. INTRODUCTION 2

Trust Case Control Consortium Data Set in 2007, and assesses 14,000 cases and 3000 controls in order to identify variants associated with common diseases [10]. Since then GWAS have lead to the identification of thousands of loci associated with a large number of phenotypes [11, 12]. In this chapter I first show how a GWAS is performed, and then describe how my thesis builds on existing GWAS by developing new methods, asking new questions, and integrating GWAS results with functional data.

1.1 Genome wide association studies

The goal of a GWAS is to detect loci that have statistically significant differences in genotype frequencies between individuals who have a phenotype of interest and the general population. A commonly used study design is a case-control study, in which individuals who have a known phenotype (such as a disease) are recruited to be part of the case population, and matched healthy individuals are recruited to be part of the control population. GWAS have also been performed in longitudinal cohort studies in which a population has been followed over time [13]. Most recent GWAS studies have used genotyping platforms that allow the measurement of 500,000 to a million SNPs per sample, and can thus be used to detect many common variants associated with a phenotype across the entire genome. This makes GWAS particularly applicable to the study of common diseases under the assumption that a large number of common variants each contribute a small amount to the overall disease risk [14]. For each SNP, the numbers of individuals with each genotype are tallied separately in the cases and the controls. As statistical test is then applied to the genotype counts in order to assess whether there is a statistically significant difference between the genotype distribution in cases and in controls. A tutorial by Balding [15] summarizes statistical methods that are relevant to GWAS. As the number of SNPs assessed in a GWAS is very large, it is important to correct for multiple hypothesis testing. While multiple methods have been proposed, most studies use the conservative Bonferroni correction [16], which multiplies each P-value by the number of tested hypotheses. In this context the number of hypotheses is equal to the number of SNPs on the CHAPTER 1. INTRODUCTION 3

genotyping chip. While a P-value indicates how much statistical support there is for a given association, it does not directly show how strong the effect is. An association with a weak effect but which is tested in a very large population can obtain a much stronger P-value than an association with a stronger effect that is tested in a smaller population. Odds ratios can be computed in order to assess the effect size, but it is important to keep in mind that a large odds ratio does not necessarily mean that the association is significant. Figure 1.1 provides an overview of the approach used in order to assess the association between a SNP and a phenotype in a GWAS. SNPs that are physically located in close proximity in the genome tend to be cor- related with each other [17]. This phenomenon is a consequence of the evolution of the human genome and is called Linkage Disequilibrium (LD) [18]. Various metrics can be used to quantify linkage disequilibrium [19], including the squared correlation coeffi- cient r2. Figure 1.1 shows two perfectly correlated (r2 = 1.0) SNPs, SNP2 and SNP3. Perfectly correlated SNPs are said to be in perfect linkage disequilibrium. SNPs can be grouped into haplotype blocks of loci that have strongly correlated genotypes. The HapMap project studied this haploblock structure in multiple populations [6, 7]. This information is used in the design of genotyping platforms in order to decide which SNPs in a haploblock need to be measured, and which SNPs can be inferred using in- formation from other correlated SNPs. In the example of Figure 1.1, measuring SNP3 in addition to SNP2 would not provide any additional information. The SNP that is present on the genotyping platform is called the tag SNP. Imputation methods [20] can be used to obtain the likely genotypes of the individuals in the study at variants that were not assessed directly through genotyping, but whose correlation with tag SNPs is known from higher resolution data such as HapMap. Therefore, while GWAS allow the identification of particular tag SNPs that are statistically associated with a phenotype of interest, these SNPs are often part of a larger region of linkage disequi- librium. Regions of strong linkage disequilibrium can be large, and SNPs associated with a phenotype have been found to be in perfect linkage disequilibrium with SNPs several hundred kilobases away. Linkage disequilibrium therefore makes it difficult to precisely pinpoint which SNPs play a functional role in the phenotype of interest, and which SNPs happen to be associated with a phenotype only because they are in CHAPTER 1. INTRODUCTION 4

Complete Linkage Disequilibrium Rare variant SNP 3 SNP 1 SNP 2 :-) ATACGGTATTAGCAAATAAACGATAGCATACAAA ATACGCTATTAGCAATTAAACGATAGGATACAAA

:-) ATACGCTATTAGCAAATAAACGATAGCATACATA

Controls ATACGCTATTAGCAAATAAACGATAGCATACAAA :-) ATACGGTATTAGCAAATAAACGATAGCATACAAA ATACGGTATTAGCAAATAAACGATAGAATACAAA

:-( ATACGCTATTAGCAATTAAACGATAGGATACAAA ATACGGTATTAGCAATTAAACGATAGGATACAAA :-( ATACGGTATTAGCAATTAAACGATAGGATACAAA

Cases ATACGCTATTAGCAATTAAACGATAGGATACAAA :-( ATACGGTATTAGCAAATAAACGATAGCATACAAA ATACGCTATTAGCAATTAAACGATAGGATACAAA

AA AT TT :-) 2 1 0 Summary statistics: P-value :-( • 0 1 2 • Odds ratio

Figure 1.1: Overview of a genome wide association study In this cartoon example of a Genome Wide Association Study, three healthy control individuals are compared to three case individuals that have some disease phenotype of interest. Each individual has two copies of each chromosome. Three SNPs in which there is frequent variation in the population are shown. One rare variant, for which only one individual has a mutation is also shown. Parts of the sequence that are identical in the entire population are in grey. SNP1 and SNP2 are perfectly correlated: if a chromosome contains the A allele at SNP2, it contains the C allele at SNP3, and if it contains the T allele at SNP2, it contains the G allele at SNP3. SNP2 and SNP3 are said to be in perfect linkage disequilibrium. The genotype counts for SNP2 are shown in a 2x3 table. Summary statistics for SNP2 can be computed based on this table. CHAPTER 1. INTRODUCTION 5

linkage disequilibrium with another SNP that has a functional role. On Figure 1.1 both SNP2 and SNP3 are equally strongly associated with the phenotype since their genotypes are perfectly correlated. While GWAS have lead to the identification of a large number of variants associ- ated with common diseases, several major challenges remain unaddressed [21]. First, interpreting GWAS results is difficult since most reported associations merely point to larger regions of correlated variants [22]. Furthermore, while GWAS provide a list of SNPs that are statistically associated with a phenotype of interest, they do not offer any direct evidence about the biological processes that link the associated variant to the phenotype. The fraction of associated loci found in GWAS that overlap known coding regions is relatively low. While many associated SNPs are located near known , strong associations have been found in so-called gene deserts [23, 24]. Second, individual variants often have a small effect size, and even all variants associated with a disease together only explain a small fraction of the disease risk [25]. Third, associations identified in one study may not be replicable in a different study [26], specially if the study is done in a population of different geographic origin.

1.2 Integrative analysis methods

In this thesis, I present three integrative approaches that extend genome wide as- sociation studies by developing new methods, asking new questions, and integrating new data. These approaches are applied to a wide variety of data sets, and provide insights into a broad range of complex human diseases. Chapter 3 discusses a method for identifying similarities between diseases at the genetic level. This is an extension on GWAS both from a methods and from a ques- tions perspective. Identifying genetic similarities between diseases is highly relevant for the translation of GWAS results to medicine: if two diseases share a common genetic architecture, then it is likely that the underlying disease processes are also similar. I develop new methods in order to achieve the goal of identifying disease similarities using information from multiple SNPs. By training a classifier that dis- tinguishes cases and controls, a model of the disease architecture is learned. This is CHAPTER 1. INTRODUCTION 6

a significant improvement over methods that only consider similarities at the level of individual SNPs. The trained classifier is then applied to individuals that have a different disease, and by aggregating the classification of those individuals, we can estimate how close the genetic architecture of the two diseases are. While the per- formance of the classifier is insufficient to apply it to single individuals, it is able to identify significant relationships between diseases when aggregating predictions over a large number of individuals. In Chapter 4, I re-purpose GWAS data in order to ask a new question. The genotype information of individuals used as controls in a GWAS is now used to identify differences between males and females. This approach allows me to identify significant differences in the pseudoautosomal regions of the sex chromosomes X and Y. Therefore data originally collected for the purpose of identifying variations linked to disease risk shine a new light onto the more fundamental biological question of the differences between males and females, in a particularly interesting, yet understudied region of the human genome. In Chapter 5, I go back to the traditional associations identified using GWAS, but with the goal of adding to the understanding of the biological mechanism underlying individual associations. I integrate functional data about experimentally identified regulatory and transcribed regions together with GWAS results. Linkage disequilib- rium information is used to study the entire regions associated with a phenotype. I show that functional hypotheses can be generated for a majority of previously identi- fied associations. This is also an exercise in re-purposing data: the data sets used in this chapter were generated by ENCODE in order to help understand which regions of the human genome are functional, and how those functional aspects differ between cell lines. Integrating these data sets with GWAS results is, however, one of the most promising avenues for translating information about human gene regulation identified by the ENCODE consortium to the study of human disease in general. While Chapter 6 mainly discusses specific examples of functional SNPs identified in Chapter 5, the analysis is also integrative. The analysis of the 9p21 region relies on using information in a different way than originally intended, as I build a population genetics argument based on a negative result, the lack of replication of a prominent CHAPTER 1. INTRODUCTION 7

association in a different population, in order to show how a functional SNP may play an important role in coronary artery disease. Using GWAS data in an innovative way presents interesting challenges from a quality control perspective. Chapter 2 highlights two particular aspects of genotype calling that are directly relevant to the rest of the work presented herein. Chapter 2

Data Quality

2.1 The Wellcome Trust Case Control Consortium data set

In chapters 3 and 4 of this thesis, we use individual level genotyping data provided by the Wellcome Trust Case Control Consortium (WTCCC). Authorization to use these data sets were obtained separately from the WTCCC for each project. The data sets we use come from a genome-wide association study [10] of seven common diseases: type 1 diabetes (T1D), type 2 diabetes (T2D), coronary artery disease (CAD), Crohn’s disease (CD), bipolar disease (BD), hypertension (HT), and Rheumatoid Arthritis (RA). The data consist of a total of 2000 individuals per disease and 3000 shared controls, with 1500 control individuals from the 1958 British Birth Cohort (58C control set) and 1500 individuals from blood donors recruited specifically for the project (UKBS control set). The genotyping of 500,568 SNPs per individual was performed using the Affymetrix GeneChip 500K Mapping Array Set. In the original analysis of this data set by the WTCCC, a total of 809 individuals and 31,011 SNPs that did not pass quality control checks are excluded. In addition, SNPs that appear to have a strong association in the original study have been manually inspected for quality issues, and 578 additional SNPs were removed. In this work, we exclude all individuals and SNPs that were excluded in the WTCCC study, as well

8 CHAPTER 2. DATA QUALITY 9

as an additional 9,881 SNPs that do not appear in the WTCCC summary results. The use of genotyping data as input to a classifier (Chapter 3) and to study differences between males and females (Chapter 4) presents challenges that do not exist when performing a case-control study as done by the WTCCC. In this chapter we discuss specific data quality artifacts that could have affected our results, and how we addressed them.

2.2 Genotype calling artifacts

A major concern in the analysis of GWAS data is the possibility that reported geno- types for an individual could be incorrect. Genotype calling algorithms process the raw signal obtained from the genotyping chips in order to assign a genotype to each SNP and for each individual. Inaccuracies in genotype calling can lead to false pos- itives in a GWAS [27]. While current algorithms are very accurate, a very large number of SNPs are analyzed in a GWAS. This means that even a very small error rate could lead to many false positives. If a genotype calling algorithm is accurate for 99.9% of all SNPs, then over 500 SNPs would still be incorrect for a study of the size of WTCCC. The purpose of the quality control steps performed by the WTCCC is to identify SNPs that have poor genotype quality, either due to poor raw data quality on the chip, or due to genotype calling errors. Quality control steps need to carefully balance sensitivity and specificity. An overly sensitive approach would eliminate a larger number of SNPs from the subsequent steps of the study. If a SNP that was correctly genotyped is eliminated at this stage, then a potentially significant association may be missed. It does therefore make sense to choose a more specific approach, and only discard SNPs that are very clearly of poor quality. This can be done without sacrificing sensitivity by adding an additional quality control step after the analysis has been performed. All SNPs identified to be significantly associated with the phenotype are inspected to identify any genotyping or genotype calling is- sue. As the number of significant associations is orders of magnitude lower than the number of SNPs on the genotyping platform, this step can be done manually. The WTCCC study uses this approach to ensure that none of the reported associations CHAPTER 2. DATA QUALITY 10

is a false positive due to a genotyping artifact. In this section we briefly describe genotype calling, provide examples of artifacts, and then describe an alternative method that we need to use when applying a classifier to genotyping data.

2.2.1 Genotype calling

A SNP assessed on a genotyping platform generally has two alleles, a major allele and a minor allele. The genotyping array contains multiple probes of the reverse comple- ment of the sequence around each SNP. Some probes contain the reverse complement of the sequence including the major allele, and others contain the the reverse comple- ment of the sequence including the minor allele. The probe sequences are chosen in such a way that only the sequence near the SNP binds to it. Small fragments of the genome of the individual which is being genotyped then bind to those probes. If the individual is homozygous for one of the alleles, then the sequence fragments around the SNP will only bind to probes that have the reverse complement of the sequence including that allele. If the individual is heterozygous, then half the sequence frag- ments will contain the major allele, and half will contain the major allele, and they will bind to the corresponding reverse complement probes. The amount of binding for both probes is then measured as a signal intensity. As binding affinities between the sequence and its reverse complement is variable, the genotyping calls are made for all individuals in the population at the same time. Signal intensities for both probes can be represented on a two-dimensional plot. Individuals that are homozygous will form clusters along each axis, whereas individuals that are heterozygous will form a cluster along the diagonal. Clustering algorithms are then used to assign each individual to one of three clusters, and thus determine its genotype for the SNP of interest. Figure 2.1 shows the appearance of a genotyping signal intensity plot.

2.2.2 Examples of artifacts

An example of a SNP in which an error in genotype calling in the original WTCCC data leads to a false positive is rs2241572. The original genotyping counts made by CHAPTER 2. DATA QUALITY 11

Homozygote AA xx xxxx xxxxx x xx xxxxx Heterozygote AT Allele Allele A x xx

x Normalized Signal for xxxxx xx xx Homozygote TT

Normalized Signal for Allele T Figure 2.1: Overview of genotype calling Each X represents the genotype of an individual. The arrows indicate the direction along which individuals with each genotype will cluster. Dashed lines represent the clusters that a genotype calling algorithm should identify for this example. the Chiamo algorithm [28] used by WTCCC are shown in Table 2.1. Computing a P- value based on these counts leads to an extremely significant association with coronary artery disease. This association was flagged as a false positive by the WTCCC after inspection of the genotype intensity plots (Figure 2.2). It is interesting to note that a cause of the specific genotype calling error shown here is that WTCCC performed genotype calling separately for each population, and the algorithm made cluster- to-genotype assignment decisions that were inconsistent between populations. This issue would likely have been partially avoided if genotype calling had been performed jointly for the cases and the controls. An additional issue comes from the fact that the algorithm did not identify the small number of individual with GG genotype as a cluster, and thus incorrectly assigned this genotype to other clusters. A second genotyping algorithm (the standard Affymetrix algorithm BRLMM) correctly calls the genotypes for this SNP (Table 2.1), and there is no significant difference between cases and controls. CHAPTER 2. DATA QUALITY 12

Chiamo GG GC CC Controls 2924 0 0 Coronary Artery Disease 240 0 1658 BRLMM GG GC CC Controls 9 335 2580 Coronary Artery Disease 8 242 1638

Table 2.1: Incorrect and correct genotype calls for rs2241572

Figure 2.2: Genotype calling for rs2241572 MM Mm mm In the control sets (58C and NBS), the genotype calling algorithms clusters both individuals homozygous for the C allele and heterozygous into a single cluster (red). This cluster isControls assigned the GG genotype.2924 In the cases (CAD),0 the clustering0 al- gorithm finds two clusters, which are assigned the GG genotype (red) and the CC genotypeCoronary (blue). artery disease 240 0 1658 • Different algorithm: MM Mm mm 2.2.3 Consensus approach Controls 9 335 2580 In the WTCCC study, visual inspection of signal intensities is done after the anal- ysis, whichCoronary makes it artery possible todisease manually inspect8 the small242 subset of1638 SNPs that are potentially significant. In a classifier-based approach, it is impractical to perform any kind of visual inspection, and we must try to minimize the errors due to genotype calling prior to the analysis. It is intractable to manually inspect all SNPs prior to training the classifier. If we choose to inspect SNPs that are used as features by the classifier after training, then the whole classifier needs to be re-trained every time a CHAPTER 2. DATA QUALITY 13

feature must be discarded due to poor data quality. Such an iterative approach is also slow, and only works if a small number of features are used (which is the case in a decision tree). In order to be able to use any classifier, we developed an approach that combines the independent genotype calls of different algorithms in order to lower the risk of genotyping artifacts. While additional genotype calling algorithms were available at the time of our study [29], we did not have access to the raw genotyping chip signal intensity data that would have been necessary to use them. The WTCCC study only uses genotype calls made by a custom algorithm, Chi- amo [28], but the genotype calls made using the standard Affymetrix algorithm BRLMM are also available. While the study does show that Chiamo has, on av- erage, a lower error rate than BRLMM, there are SNPs that are discarded during the quality control process that show errors in the genotype calls made by Chiamo (such as the example shown in Section 2.2.2). We use the two genotype sets to create a consensus data set in which the genotype of a given individual at a given SNP is used only if there is agreement between the call made by Chiamo and the call made by BRLMM, and is considered to be unknown if the calls are different. This ap- proach individually considers the calls made for every individual at every SNP, and does not discard entire SNPs. The handling of SNPs that have a high proportion of unknown genotypes is left to the classification algorithm, and is discussed in Sec- tion 3.6.2. While this approach does reduce the errors in genotype calling, this comes at the cost of discarding cases in which Chiamo is right but BRLMM is not. Overall, the frequency of unknown genotypes is 2% using the consensus approach, compared to 0.65% using Chiamo and 0.74% using BRLMM. Furthermore, BRLMM genotype calls are entirely missing for a total of 184 individuals, which are thus excluded from our study. After performing these pre-processing steps, the data set used in this work consists of 459,075 SNPs measured in 2938 control individuals (58C: 1480, UKBS: 1458), 1963 with type 1 diabetes, 1916 individuals with type 2 diabetes, 1882 individuals with coronary artery disease, 1698 individuals with Crohn’s disease, 1819 individuals with bipolar disorder, 1952 individuals with hypertension and 1834 individuals with rheumatoid arthritis. rs312258 58C rs312258 NBS 2.5 2.5

CHAPTER 2. DATA QUALITY 14 2.0 2.0

++ 58C NBS + + + + + +++ ++++ +

1.5 + + +++++++++ + + 1.5 ++++ + +++++++++++++ + ++ + + + ++++++++++++ + + +++++++ ++ +++++++++++++++++++ ++++++++++++ + + +++++++++++++++++++++ + ++++ ++++++++++++++++ + + + ++ ++++++++++++++++++++++++ ++ ++++++++++++++++++++++ +++++++++++++++++++++++++++++ +++++++++++++++++++++++ + ++ + ++++++++++++++++++++++++ + + +++++++++++++++++++++++ ++++++++++++++++++++++++++++++++++++ ++++++++++++++++++++++++ + +++++++++++++++++++++++++++++ +++++++++++++++++++++++++++++++++ + ++++++++++++++++++++++ + ++ + + + +++++++++++++++++++++++++++++ ++++++++++++++++++++ + + +++++++++++++++++++++++++++ + + Intensity B + ++ ++++ + + ++ ++++ ++++++++++++++++ + + Intensity B + ++++++++++++++++ + ++ +++++++++++++++ + + + ++ + + +++++++++++++++++++++ ++ + + +++++++++++ ++++ + +++ ++ ++++++ ++++++++++++++ ++++++ + + + + +++++++++ ++ + ++++ ++++ +++ + + + ++++++++++++++ + ++ ++ ++++ + +++++++ + ++++++++++++ +++++++ + +++++++++++++ + ++ + +++ + +++++ + ++ ++++++++++++++++++++++++ + ++++++++++++ + + +++++++++++++++++ + + 1.0 + + ++ ++ ++ + + + + +++ + +++++++++++++ + + 1.0 + +++ + + +++ +++++ + ++ + + + + +++++++++++++++++++++++++++++++ ++ + + ++++++++++ ++++++++++++++ + +++ + +++++++++++++++++++++++++++++++ ++ + + + + ++++++++++++++++++++ +++ ++ +++ + + + ++++++++++++++++++++++++++++++++ + + +++++++++++++++++++++++++++++++++++ ++ + ++ +++++++++++++++++++++++++++++++ + + + + ++++++++++++++++++++++++++++ ++ + ++++++++++++++++++++++++++++++++ + + ++ ++++++++++++++++++++++++++++++++++++++++++ + + + ++++++++++++++ +++++++++++++ + + + + +++++++++++++++++++++++++++++++++++++ +++++ +++++++++++++ ++ ++ ++++++ ++++++++++++++++++++ +++ + ++ +++++++++++++++ ++ +++ + + +++ +++++++++++++++++ + ++++++++ + ++++++ + + + +++++++++++++++++++++++ + + + ++ ++++ + + + + ++++ ++++++++++ + + ++ + + + ++ ++ ++++++ ++ + +++ + + + ++ + Signal for C Allele + + + Signal for C Allele + +++ + + + ++ + + + + + + + 0.5 + + 0.5 + + + ++ + + + + + + + + + + + +++ + + + + + +++ + ++ + + ++ + + + + + + +++ + + + ++ ++ ++ + + + + + + + + + ++++ ++ + + + + + + + + ++ ++ ++++ + ++ + + + + ++ +++ + +++ + ++ + + +++ +++++ + ++++ ++ + + +++ ++++++ +++++ + + + +++++ + +++ + + + +++ + + + + + + + + + + + + ++ +++++ + + + + + + + + + + + + + 0.0 0.0

0.0 0.5 1.0 1.5 2.0 0.0 2.5 0.5 1.0 1.5 2.0 2.5

Signal for Allele A Intensity A Signal for Allele A Intensity A

Figure 2.3: Genotyping chip signal intensities for rs312258 in controls

2.3 Effect of homology with the sex chromosomes

We perform visual inspection of all SNPs that show significant differences in geno- type frequency between males and females that were identified using the approach described in Chapter 4. The SNPs in PAR1 that exhibit significant differences do not show any sign of genotype calling artifact. Figure 2.3 shows the signal intensity plots for the most significant SNP rs312258. We observe that the significant differences identified on autosomes are the result of bad genotype calling. Furthermore, a difference in signal intensity between males and females can often be observed for one allele but not the other. We look for sequences that are homologous to the sequence around SNPs for which we identify this issue. We find that in such cases the sequence containing one of the alleles is homologous to sequence on either the X or the Y chromosome. can adversely affect genotyping. Figure 2.4 shows the effect of homology on the signal intensity observed on the genotyping chip. In the absence of any homologous sequence, the entirety of the signal is caused by binding of sequences around the SNP to the respective probes (Figure 2.4A). In this case the three genotype clusters can easily be distinguished. If there is homology for the sequence that includes CHAPTER 2. DATA QUALITY 15 Normalized Signal for Allele A Normalized Signal for Allele T

Homology for sequence Homology for sequence No homology including the A allele on including the A allele on an autosome the Y chromosome Figure 2.4: Effect of homology on genotype calling The green, blue and red ovals represent the areas where individuals with respectively AA, AT and TT genotypes will cluster. Clusters of male individuals are represented with dashed borders in panel C. the A allele of the SNP, but not fort the sequence that includes the T allele, then the homologous sequence will also bind to the probe for the A allele. As this binding is independent of which allele the individual has for the SNP of interest, it results in an overall increase of the observed intensity for the probe with A allele, but not for the probe with the T allele (Figure 2.4B). Note that while the absolute intensities are shifted, the normalized intensities would look similar to the case in which there is no homology. Homology with a sequence on an autosome would therefore impact the spread of the signal intensity, but genotype clusters should remain correctly separable. If there is homology with the Y chromosome, then only the intensity of males will be shifted upwards. This leads to the superposition of clusters shown in Figure 2.4C. Running a genotype calling algorithm on such an example will likely result in all heteorzygotes as well as males with TT genotype to be clustered together. This incorrect genotyping will result in significant difference between male and female genotype distributions. A similar difference in shift between males and females will exist if the homologous sequence is on the X chromosome since females will have two copies of the homologous sequence, and males only one. CHAPTER 2. DATA QUALITY 16

CC CG GG Male 1114 363 0 Female 1060 342 34

Table 2.2: Incorrect genotype calls for rs2491853

A concrete example of this situation is rs2491853, a SNP located on chromosome 1. The sequence around the SNP with the C allele at rs2491853 is homologous with a segment of sequence on chromosome Y. Figure 2.5C shows the probe sequences, which are the reverse complement of a segment of sequence that is present on chromosomes 1 and Y. Table 2.2 shows the incorrect genotype counts obtained when running Chiamo on the original signal intensities. No male individual is assigned the GG genotype, which is consistent with a shift of signal intensities for the probe binding to the sequence containing the C allele, which would be expected based on the homology of this sequence with the Y chromosome. Figure 2.5A shows the raw signal intensities for this SNP. The intensities of male individuals are shifted vertically. When scaling intensities per sex (Figure 2.5C), the intensities for males and females overlap, and it becomes visible that there are indeed male individuals with a GG genotype. We assess the extent to which SNPs on the Affymetrix 500K genotyping chip are potentially subject to this homology artifact. We intersect the list of all SNPs on the array with the Human Chained Self Alignments track of the UCSC browser [30]. This track uses a method originally developed to compare the human and mouse genomes [31] in order to align the human genome with itself, and therefore identify homologous regions of the human genome. We identify a total of 854 SNPs on the genotyping chip that are in a region that is homologous to a region of the X chro- mosome, 723 SNPs that are in a region that is homologous to a region of the Y chromosome, and 54 SNPs that are homologous to regions on both X and Y. These results illustrate that while this artifact is not limited to the case we discuss herein, it does affect less than 1% of the genotyped SNPs. In this work, we identify SNPs that show differences between males and females due to homology, and discard them as false positives. It is important to note that this issue is not limited to the specific study of genotype differences between males and rs2491853 controls CHAPTER 2. DATA QUALITY 17 2.5 2.0

A. Raw intensities B. Scaledrs2491853 per controls sex 3 + + + ++ + +++ ++ + + + +++ ++ + 1.5 ++ + + ++ + +++ +++ + + + + + 2 + ++++++ + ++ + ++ +++++++++++++ + + + ++ +++ ++ ++ ++++++++++++ + + + + +++ + ++++++++++++++++++++++ + ++ +++ +++ +++++ + ++++++++++++++++++ + + + ++ ++ ++ + ++ ++++++++++++++++++++ ++ + +++++++++++ ++ ++++++++++++++++++++++++++++ + + ++++++ +++++++ ++++ + + +++++++++++++++++++++++++++ + + +++ ++++++++++ + + ++++++++++++++++++++++++++ + + ++++++++++++++++++++++++ + + +++++++++++++++++++++++++++++++++ + + +++++++++++++++++++++++++++++ ++ + + +++++++++++++++++++++++++++ + + + + ++++++++++++++++++++ +++ ++ ++ ++++++++++++++++++++++++++++++ + + + + + + ++++++++++++++++++++++++++++ + + + ++++++++++++++++++++++++++++ ++ + + + + +++++++++++++++++++++++++++++++ + + +++ +++++++++++++ + + + + 1 ++++++++++++++++++++ + + ++++++++++++++++++++++++++ + ++ + ++ +++++++++++++++++++++++++++++++++ ++ + + Intensity B +++++++++++++++++++++++ + + ++ + + +++ +++++ ++++++++++++++++++++ + + +++++++++++++++++++++++++++++++ + + + +++ + + +++++++++++++++++++++++++++++++++++ + + ++++++++++++++++++++++++++++++++ + + + + + ++++ + ++++++++++++++++++++++++++++++++++++++ ++ ++ ++++++++++++++++++++++++++++++ + + ++ ++ ++++++++ + + ++++++++++++++++++++++++++++++++++++ + + + + ++++++++++++++++++++++++++++++ ++ ++ ++++ +++ +++ + + + ++ +++++++++++++++++++++++++++++++++ +++ ++ + ++++++++++++++++++++++++++++ + + + ++++ ++++ + + +++++++++++++++++++++++++++++++++++ ++++ + + ++++++++++++++++++++++++++ + ++ ++++ +++++++ + + +++++++++++++++++++++++++++++++++++++ + + + ++++++++++++++++++++++++++++++ ++++++++++++++++++ + +++++++++++++++++++++++++++++++++++++++++ + ++ + + + + + ++++++++++++++++++++++++++++ + + + + +++++++++++++++++++ ++ + +++++++++++++++++++++++++++++++++ + + + + + 1.0 +++++++++++ + ++ ++ + ++++++++++++ + + + + + + +++++++++++++++++++++ ++++++++++++++++ ++ + +++++++++++++++++++++++++++++++++++ + + + + +++++++++++++++++++++++ + + ++++++++++ ++ ++ +++++++++++++++++++++++++++++++++++ + + + + + + + + +++++++++++++++++++ + + ++++ +++++++++++++ + + +++++++++++++++++++++++++++++++ + + + ++ +++ + ++ + ++++++++++++++++++++ + +++++++++++++ + + +++++++++++++++++++++++++++++ +++ ++ + + ++ + + ++++++++++++++++ ++ +++++ + + 0 + ++++++++++++++++++++++++ + + ++ + ++ + + + ++ ++++++++++++++ ++ ++++++++++++++ + ++ +++++++++++++++++++++++++++++++++ +++ + ++ +++++ + + + +++++++++++++ + + ++ ++++++ +++++++++ ++ +++ + + + ++++++++++++++++++++++++++++++++++++++ + + + + ++ + + ++++++++++++++++ + ++++ +++++++++ ++ +++ + ++++ ++++++++++++++++++++++ + + + + + + + +++++ + +++ ++ + ++ +++++++++++++ +++ + ++ + ++++++++++++ +++ ++ + + + +++++++++++++++++++++++++++++ + + ++ ++ ++++++ + + + + ++ + ++++++++++ + ++ ++ +++++ ++++++++++++ + + + + +++++++++++++++++++++++++ ++ + + + + +++ ++ +++++ +++ +++++++ + ++++++++ +++++++++ + + ++++++++++++ ++++++++++++++ + +++ ++ +++++ +++ ++++ ++++ + ++ + ++++++++++++ + ++++++ +++++++++++++++++ + + ++ +++++++++++++++++++ +++++ + ++ + ++ + ++ ++ ++++++++ + ++ ++ ++ ++++++++ ++++ +++++++++++++ + + + ++++++++++++++++++++ + ++ ++ +++++ + +++++++++++++ + ++++++++ ++++ +++++ ++++++++++++ ++ + + + ++ ++++++++++++++++ ++++ ++++++++ ++++++++ +++++ + + ++ +++++ + + + ++++++++++++++ + + + + +++++++++++ ++ + + ++ ++ +++ +++++ +++++++++++ +++ + + ++++++ ++++++++ + + + + + + +++++++++++ +++ + ++ ++ ++++++++++++++++++++ +++ ++ + +++++ +++++++ ++ + +++++ +++ +++++++ + ++ + ++++++ ++++++ +++++ ++ ++++++++++ + + ++ ++++++++++++++ ++ + + + ++++++++++++ + + + + +++++++++++++++++++ + + + + + ++ + + + 1 + + +++ + + +++++ + +++ ++++ + ++ + + + + + + Normalized Intensity B + ++ + + + +++ +++++++ + + ++ + ++ ++ +++ +++++ + + − +++++ ++ + ++++++ +++ ++++++++ ++++ + + + + + ++++ + + + + ++++ + + +++ +++++++++++++ + + + + + ++ ++++ + + ++++++++ + ++++ + ++++ + ++ + + + + + + + + ++++++++ ++ + + Signal for C Allele + + + + + + ++ Signal for C Allele + + + + + + ++ + ++++++ + + + +++ + + + +++ + + + + + + +++++ ++ + + + +++ + + + + + +++++ + + + ++++ ++ + + + ++ + + + + + +++ ++ ++++ ++ + + 0.5 ++ + + + + + + + + + ++ + + + + + + + ++ + + + ++ +++ + ++ + + ++ + ++ + + + ++ + ++ + + + + + + + + + + ++ + + + + + ++ + + + + + + +++++ + + ++ + + + + + + + + + ++ + + + + + 2 + + ++ + + + + + + ++ ++ + + + + + + + − + + + ++ + + + + + ++ + + + + + + + + + + + + ++ + + + + +++ + + + + + + + + ++++ + + + + + ++ + + + + + + + + + ++ + + + + + +++ + + + + + + + + + + + + + + ++ ++ + + + + + + + + + + + + + + + + + + + + + + ++ + ++ + + ++ + + + + + + + + + + + + + + + + + + + + + +

3 + + + + + +++ + − + + + + + + + ++ + + + + + + +

0.0 + + + + +

−1 0 1 2 3 0.0 0.5Signal for Allele 1.0G 1.5 2.0 Signal2.5 for Allele G Normalized Intensity A Intensity A Probes sequences: C. AAACAAGAGGGACTGAGGTGAAGGT AAACAAGAGGCACTGAGGTGAAGGT ACAAGAGGCACTGAGGTGAAGGTTT ACAAGAGGGACTGAGGTGAAGGTTT AAAAACAAGAGGCACTGAGGTGAAG AAAAACAAGAGGGACTGAGGTGAAG ACAAAAAACAAGAGGGACTGAGGTG ACAAAAAACAAGAGGCACTGAGGTG

chr1:243,090,399-243,090,437 TCTTGTTTTTTGTTCTCCCTGACTCCACTTCCAAATA chrY:13,946,923-13,946,961 TCTTGTTTTTTGTTCTCCCTGACTCCACTTCCAAATA

Figure 2.5: Signal intensities for rs2491853 In panels A. and B. males are represented in red and females in blue. Boxes indicate individuals for which the algorithm did not make a genotype call. Panel C. represents the probes assessing this SNP on the Affymetrix chip, and the sequence surrounding the SNP on chromosome 1. The sequence containing the C allele at rs2491853 is homologous with chromosome Y. CHAPTER 2. DATA QUALITY 18

females. Genotype calls will likely be incorrect for both cases and controls whenever there is homology between the region around a SNP and a region on a sex chromo- some. As the error is likely to affect one sex disproportionately (for example males in rs2491853), any difference in sex proportions between the two studies could lead to a difference in observed genotype frequency between cases and controls. Because of this artifact, sex can therefore become a hidden variable. Hidden variables can lead to false positives even when there is no effect of sex on actual genotype frequencies and no association between the SNP and the disease in either sex. Every GWAS should therefore include a step in which this issue is addressed for all SNPs in regions ho- mologous to a sex chromosome. Running a genotype calling algorithm separately on males and females could potentially lead to similar issues as discussed in Section 2.2.2. It is therefore recommended to first rescale the intensity for each sex, and then run the genotype calling algorithm on the rescaled intensities of all individuals. Chapter 3

Identifying Similarities Between Diseases

3.1 Introduction

Genome-wide Association Studies (GWAS) allow the identification of associations between genotype and phenotype. The Wellcome Trust Case Control Consortium (WTCCC) genotype 500,000 SNPs in seven common diseases: type 1 diabetes (T1D), type 2 diabetes (T2D), coronary artery disease (CAD), Crohn’s disease (CD), bipolar disease (BD), hypertension (HT) and rheumatoid arthritis (RA) [10]. In this Chapter we use the individual genotype data from this study in order to identify similarities between the genetic architecture of diseases. Computational methods have been used to identify disease similarities using a variety of data sources, including gene expression in cancer [32] and known relation- ships between mutations and phenotypes [33]. However, while a large number of GWAS focusing on individual diseases have been recently published, the attempts to integrate the results of multiple studies have been limited. Most of these integra- tion approaches focus on combining multiple studies of the same disease in order to increase the statistical power [34], or use data from other high-throughput measure- ment modalities to improve the results of GWAS studies [35]. Comparison between the genetic components of diseases have been done using four different approaches.

19 CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 20

The first approach is based on the identification of the association between one SNP in two different diseases in two independent studies. The second approach selects a group of SNPs that have been previously associated with some disease and tests if they are also associated with a different disease. An example of this approach is the genotyping of a large number of individuals with type 1 diabetes at 17 SNPs that have been associated with other autoimmune diseases, which leads to the iden- tification of a locus previously associated with only rheumatoid arthritis as being significantly associated with type 1 diabetes as well [36]. The third approach pools data from individuals with several diseases prior to the statistical analysis, and has been used in the original WTCCC study. Several similar diseases (autoimmune dis- eases, metabolic and cardiovascular diseases) are grouped in order to increase the statistical power for identifying SNPs that are significantly associated with all the diseases in the pool. The fourth approach compares the results of multiple GWAS, and has been previously applied to the WTCCC data set [37]. They use the P-values indicating the significance of the association between a SNP and a single disease, and compute the correlations between these P-values in pairs of diseases, as well as the size of the intersection of the 1000 most significant SNPs in pairs of diseases. They identify strong similarities between type 1 diabetes and rheumatoid arthritis, between Crohn’s disease and hypertension, and between bipolar disease and type 2 diabetes. In this work we introduce a novel approach to identify similarities in the genetic architecture of diseases. We train a classifier that distinguishes between a reference disease and the control set. We then use this classifier to classify all the individuals that have a query disease. If there is a similarity at the genetic level between the query disease and the reference disease, we expect more individuals with the query disease to be classified as belonging to the disease class than if there is no similarity. We generalize our procedure to multiple disease comparison: given a set of multiple diseases, we use each in turn as the reference disease while treating all others as query diseases. There are two main differences between our new approach and existing analyses. First, previous approaches (such as [37]) compute a significance score for each SNP, and then use these scores for comparing diseases. In our approach, we first compute a CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 21

classification for each individual, and then compare diseases using these classifications. Second, we train the classifier using information from all SNPs, and during this learning process select the SNPs that contribute to the classification based on the genotype data only. This genome-wide approach makes it possible to see the classifier as a statistical representation of the differences between the disease set and the control set. The use of classifiers in the context of GWAS has been limited so far. In partic- ular, attempts at using them for predicting outcome based on genotype have been unsuccessful. For example, a recent prospective study in type 2 diabetes [38] found that using 18 loci known to be associated with type 2 diabetes in a logistic regression classifier together with known phenotypic risk factors does not significantly improve the risk classification, and leads to a reclassification in only 4% of the patients. A par- ticular challenge in the context of outcome prediction is that the prevalence of most diseases is relatively low and that it is therefore necessary to achieve high precision in order for the classifier to be usable. Our goal is not predicting individual outcomes, and we only compare predictions made by a single classifier. We can therefore ignore disease prevalence. A second challenge in the use of a classification approach for finding disease simi- larities is that the classifier does not explicitly identify genetic features of the disease, but rather learns to distinguish the disease set from the control set. Differences between the two sets that are due to other factors might therefore lead to incorrect results. In most GWAS, a careful choice of matched controls limits this risk. However, when using a classifier trained on one GWAS to classify individuals from a different study, there is a risk that the background distribution of SNPs is very different be- tween the populations in which the data sets have been collected, which could lead to errors, particularly when comparing diseases using data sets from different geographic origins. This risk can be limited by using disease data from a single source. In this work, we use genotype data provided by the WTCCC study, in which all individu- als were living in Great Britain and individuals with non-Caucasian ancestry were excluded. In this chapter, we first provide a detailed description of the analysis approach. We CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 22

then show that we are able to train classifiers that achieve a classification error that is clearly below the baseline error for type 1 diabetes, type 2 diabetes, bipolar disease, hypertension and coronary artery disease. We use these classifiers to identify strong similarities between type 1 diabetes and rheumatoid arthritis, as well as between hypertension and bipolar disease, and weak similarities between type 1 diabetes and both bipolar disease and hypertension. We also show that we are able to train a classifier that distinguishes between the two control sets in the WTCCC data. We use this classifier to identify similarities between some diseases and individual control sets. This finding matches observations made during the quality check phase of the original study. The implications of this finding on our approach are addressed in the results section. Finally, we discuss the implications of the similarities we find, and propose extensions of this approach. A detailed description of the data set used in this work, the data pre-processing, the decision tree classifier and the comparison procedure are provided in respectively Chapter 2 and Section 3.6.

3.2 Approach

In this section, we define the general classifier-based approach to identify genetic similarities between diseases. The approach can be separated into four steps: data collection, pre-processing, classifier training and disease comparison. Figure 3.1 pro- vides an overview of the training and comparison steps. The data collection step consists of collecting samples from individuals with several diseases, as well as matched controls, and genotyping them. Alternatively, existing data can be reanalyzed. In both cases, it is important to limit the differences between the disease sets and the control sets that are not related to the disease phenotype. Similarly, differences between the different disease sets should also be limited. In particular, it is recommended to use individuals with the same geographic origin, the same ancestry, and a single genotyping technology for the whole study. In this work we use existing data from the WTCCC which satisfies these criteria. CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 23

Genotypes of Individuals with a Reference Disease Genotypes of Healthy Control Individuals

Learn a Classi er For a Reference Disease vs. Control

Reference Disease Controls 500 600 400 300 400 200 200 100 Frequency Frequency 0 0

0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 Disease-class probability Disease-class probability (average shown in red) (average shown in red)

Apply Reference Disease Classi er to Genotype Data from Other Dieases and Compare Results

Other diseases Disease A Disease B Disease C

0.26 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.65 Controls Reference Disease Disease-class probability

Figure 3.1: Overview of the approach. This figure presents the classification and comparison steps of our analysis pipeline. These steps are repeated using a different reference disease each time. The classifier returns a real value between 0.0 and 1.0 which we call disease-class probability. The histograms represent the distribution of the disease-class probability of the individ- uals with the reference disease (left) and of the controls (right). In the situation depicted on this figure, there is evidence that query disease C is more similar to the reference disease than the other query diseases. CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 24

In the pre-processing step, the data are filtered and uncertain genotype measure- ments, as well as individuals and SNPs that do not fit quality requirements are dis- carded. It is important to develop pre-processing steps that ensure good data quality. Approaches that analyze each SNP individually can afford to have a more stringent, often manual post-processing step on the relatively few SNPs that show strong asso- ciation. The SNPs that do not pass this quality inspection can be discarded without affecting the results obtained on other SNPs. In our approach however, classifier training is done using genome-wide information, and removing even a single SNP used by the classifier could potentially require re-training the entire classifier. It is therefore impractical to perform any kind of post-processing at the SNP level. Chap- ter 2 describes the data used in this work, as well as the quality control measures we take. The classifier training and comparison steps are interleaved. We start with a list of diseases and a set of individual genotypes for each disease, as well as at least one set of control genotypes. We pick one disease as reference disease, and refer to the remaining diseases as query diseases. We train a classifier distinguishing the corresponding disease set from the control set. For any individual, this classifier could either return a binary classification (with value 0 and 1 indicating that the classifier believes the individual is part of, respectively, the controls class or the disease class) or a continuous value between 0 and 1. This continuous value can be seen as the probability of the individual to be part of the disease class, as predicted by the classifier. We refer to this value as disease-class probability. For simplicity, we will only use the disease-class probability values for the rest of this section, but the comparison step can be performed similarly using binary classifications. During the comparison step, we classify individuals from the query disease sets using the classifier obtained in the training step, and for each query disease, compute the average disease-class probability. The training and comparison steps are then repeated so that each disease is used once as reference disease. We can compare the average disease-class probability of the different query dis- eases to identify similarities between them. Diseases that have a higher average disease-class probability are more likely to be similar to the reference disease than CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 25

diseases with a lower average disease-class probability. Using cross-validation, we can obtain the average disease-class probability of the reference disease set and the control set used for training the classifier, and compare them to the values of the other dis- eases. One particular caveat that needs to be considered in this analysis is that while the classifier does distinguish the control set from the disease set, there is no guaran- tee that it will only identify genetic features of the disease set. It is also possible that it will identify and use characteristics of the training set, especially if there are data quality issues. This case can be identified during the comparison step if the average disease-class probability of most query diseases is close to the average disease-class probability of the reference disease, but very different from the average disease-class probability of the control set. It is therefore important to look at the distribution of the average disease-class probabilities of all query diseases before concluding that an individual disease is similar to the reference disease. It is important to note that the disease-class probability of a given individual does not correspond to the probability of this individual actually having the disease. The disease frequency is significantly higher in the data sets we use for training the classifier than in the real population. In a machine learning problem in which the test data are class-imbalanced, training is commonly done on class-balanced data, and class priors are then used to correct for the imbalance. Such priors would, however, scale all probabilities linearly, and would not affect the relationships we identify, nor their significance. Estimating the probability of an individual having the disease is not the goal of this project and we can therefore ignore class priors. A large variety of classifiers can be integrated into the analysis pipeline used in our approach. The methods section provides a more formal description of the classification task. In this work, we use a common classifier, decision trees, to show that this approach allows us to identify similarities. The specific details about the decision tree classifier, and how its outputs are used in the analysis step are described in the methods section. CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 26

Disease Baseline Error Precision Recall ∆p Leaves T1D 40.05% 22.93% 71.65% 70.71% 0.383 9 RA 38.43% 33.45% 59.12% 42.09% 0.130 12 BD 38.24% 33.59% 62.60% 30.18% 0.087 11 HT 39.92% 36.77% 57.98% 28.64% 0.080 12 CAD 39.05% 36.62% 55.25% 32.73% 0.075 12 T2D 39.5% 38.0% 54.12% 25.05% 0.052 14 CD 36.63% 36.28% 29.83% 18.43% 0.046 11

Table 3.1: Classifier performance (cross-validation) Baseline corresponds to the baseline error, Error, Precision and Recall to the cross- validation performance of the decision tree classifier, ∆p to the difference between the average disease-class probability of the control set, and the average disease-class probability of the disease set, and Leaves to the maximum number of leaves in the pruned classifiers for this disease.

3.3 Results

We evaluate the ability of our analysis approach to identify similarities between dis- eases using the set of seven diseases provided by the WTCCC. In this section, we first evaluate the performance of individual classifiers that distinguish one disease from the joint control set. We then show that these classifiers can identify similarities between diseases. Finally, we use our classifier to identify differences between the two control sets, and provide evidence indicating that these differences do not affect the disease similarities we identify.

3.3.1 Classifier performance

We first train one classifier for each disease using both the 58C and the UKBS sets as controls. The performance of each classifier is evaluated using cross-validation, and reported in Table 3.1. We compare our classifier to a baseline classifier that classifies all individuals into one class without using the SNP data at all. The best error such a classifier can achieve during cross-validation is the frequency of the smaller class in the training set. We refer to this value as the baseline error. CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 27

The disease for which the classifier performs best is type 1 diabetes, with a clas- sification error of 22.93%, compared to a baseline error of 40.05%. The classification error obtained by the decision tree classifier is also below the baseline error for sev- eral other diseases, although by a substantially smaller margin. This is the case for rheumatoid arthritis (with an error of 33.45% versus 38.43%), bipolar disease (33.59% versus 38.24%), hypertension (36.77% versus 39.92%) and coronary artery disease (36.62% versus 39.05%). For two diseases, type 2 diseases and Crohn’s dis- ease, the improvement compared to the baseline error is only minimal, and we choose not to use these classifiers in our analysis. While the classifiers that we keep only provide small improvements in terms of classification error (with the exception of type 1 diabetes), they have a significantly better trade-off between precision (at least 55%) and recall (at least 28%) than the baseline classifier (which would classify all individuals as controls). We do not use these classifiers in a binary way, but rather use the disease-class probability, which is the conditional probability of an individual to be part of the disease-class given its genotype, under the model of the reference disease learned by the classifier (see Methods for a precise definition for decision trees). It is therefore interesting to consider the distributions of the disease-class probability, as obtained during cross-validation. Figure 3.2 illustrates that these distributions differ signifi- cantly for type 1 diabetes. It can also be seen that there are individuals for which the disease-class probability is close to 50%, meaning that there are leaf nodes in the classifier that represent subsets of the data that cannot be distinguished well. Our approach takes this into account by using disease-class probabilities rather than binary classifications. In order to evaluate the ability of our classifiers to distinguish between the disease set and the control set using the disease-class probability met- ric, we use the difference ∆p of the average disease-class probability between the two sets. The classifiers that we keep all have values of ∆p above 0.075. This illustrates that while there are only small improvements in binary classification performance, the classifiers are able to distinguish between the disease set and the control set in the way we intend to use them. CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 28

3.3.2 Disease similarities

For each of the five classifiers with sufficiently good performance, we compute the aver- age disease-class probability of each of the six query diseases. In summary, we identify strong symmetrical similarities between type 1 diabetes and rheumatoid arthritis, as well as between bipolar disease and hypertension. Furthermore we find that type 1 diabetes is closer to both bipolar disease and hypertension than other diseases, even though we did not find the symmetrical relation using the type 1 diabetes classifier. This section provides a detailed presentation of these results. For type 1 diabetes, the average disease-class probability for the control set and the disease set, as computed using cross-validation, are respectively 0.259 and 0.642. Figure 3.2 shows the distribution of the average disease-class probabilities for the query diseases. Rheumatoid arthritis, another auto-immune disease, is clearly the closest to type 1 diabetes (average disease-class probability of 0.337). This result is significant, with P-value smaller than 10−5 (see the Methods section for details on how P-values are obtained). All other diseases have an average disease-class probability that is close to that of the control set, which means that there is no evidence of similarity with type 1 diabetes. For rheumatoid arthritis, the average disease-class probabilities are 0.303 for the control set and 0.433 for the disease set. The distribution of the average disease-class probabilities for the other diseases are shown on Figure 3.3a. We can observe that type 1 diabetes (average disease-class probability of 0.397) is closest to rheumatoid arthritis (P-value < 10−5), meaning that we find a symmetrical similarity between the two diseases. All other diseases have an average disease-class probability close to the one of the control set. For bipolar disease, the average disease-class probabilities are 0.297 for the con- trol set and 0.384 for the disease set. The distribution of the average disease-class probabilities for the query diseases are shown on Figure 3.3b. We can observe that there is a wider spread in the average disease-class probabilities, and that there is no cluster of diseases close to the control set. We can also observe that hypertension (average disease-class probability of 0.359, P-value < 10−5) is closest to bipolar dis- ease, followed by type 1 diabetes (average disease-class probability of 0.354, P-value CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 29

Control 600 400 Frequency 200 0

0.0 0.2 0.4 0.6 0.8 1.0 Disease-class probability

Other diseases

CD: 0.239 BD: 0.246 CAD: 0.248 T2D: 0.254 HT: 0.255 RA: 0.337

0.259 0.3 0.35 0.4 0.45 0.5 0.55 0.6 0.642

Disease-class probability

Type 1 Diabetes 500 400 300 Frequency 200 100 0

0.0 0.2 0.4 0.6 0.8 1.0 Disease-class probability

Figure 3.2: Distribution of the disease-class probabilities for the type 1 dia- betes classifier. The two histograms show the distribution of the disease-class probability of the individuals respectively in the joint control set (top) and in the type 1 diabetes set (bottom), as computed during cross-validation. The red lines represent the average disease-class probabilities, and the black line indicates the 0.5 probability cut-off used for binary classification. The plot in between the histograms shows the average disease-class probabilities of the six other diseases on the interval between the average disease-class probabilities of the control set and of the disease set. CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 30

Rheumatoid Arthritis Controls: 0.303 CD: 0.3 BD: 0.305 CAD: 0.307 HT: 0.307 T2D: 0.312 T1D: 0.397 RA: 0.433 a 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44

Disease-class probability

Bipolar Disease Controls: 0.297 CD: 0.332 CAD: 0.338 RA: 0.343 T2D: 0.344 T1D: 0.354 HT: 0.359 BD: 0.384

b 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44

Disease-class probability

Hypertension Controls: 0.315 CAD: 0.338 CD: 0.343 RA: 0.348 T2D: 0.351 T1D: 0.368 BD: 0.381 HT: 0.395 c 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44

Disease-class probability

Coronary Artery Disease Controls: 0.303 HT: 0.319 CD: 0.321 RA: 0.331 BD: 0.337 T1D: 0.341 T2D 0.342 CAD: 0.378 d 0.3 0.32 0.34 0.36 0.38 0.4 0.42 0.44

Disease-class probability

Figure 3.3: Disease-class probabilities comparisons. The plots represent the interval between the average disease-class probabilities of the control set and of the disease set for respectively rheumatoid arthritis, bipolar disease, hypertension and coronary artery disease. The average disease-class prob- abilities for all the query diseases are shown in blue on every plot. Note that while all plots on this figure use the same scale, different scales are used for the central plots of Figures 3.2 and 3.4. CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 31

of 0.001). For hypertension, the average disease-class probabilities are 0.315 for the con- trol set and 0.395 for the disease set. The distribution of the average disease-class probabilities for the other diseases are shown in Figure 3.3c. We can observe that bipolar disease (average disease-class probability of 0.381, P-value < 10−5) is clearly closest to hypertension. Type 1 diabetes (average disease-class probability of 0.368, P-value < 10−5) is also closer to hypertension than the remaining diseases. For coronary artery disease the average differences between the query diseases are smaller than for all the other classifiers (Figure 3.3d). Furthermore, the classifier for coronary artery disease is the one with the worst performance amongst the ones we use in the comparison phase. Therefore we believe that the results are not strong enough to report putative similarities identified using this classifier, even though some differences between diseases have significant P-values.

3.3.3 Differences between control sets

The original WTCCC study found several SNPs that are significantly associated with one of the two control sets. These SNPs are filtered out during pre-processing, both in the WTCCC study and in this work. However, the mere existence of differences between two control sets prompted the question whether a classifier could distinguish the two sets, and if so, what the implications of this finding would be on the validity of results obtained with these control sets. We perform several experiments using the two control sets separately, and report the results in Table 3.2. First, we train a control-control classifier that distinguishes the two control sets from each other. This classifier achieves an error of 41.15% compared to a baseline error of 49.62%, and a ∆p of 0.093. This shows that we are able to distinguish to some extent between the two control sets. Figure 3.4 shows the distribution of the 58C class probability (which corresponds to the value called disease-class probability when the classifier distinguishes between one disease and the controls). In order to verify that this result is due to differences between the two specific control set, and not the ability of our classifier to distinguish between any CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 32

Experiment Baseline Error Precision Recall ∆p Leaves UKBS / 58C 49.62% 41.15% 58.33% 64.05% 0.093 11 R1 / R2 50.03% 49.45% 50.59% 46.42% -0.003 11 UKBS / T1D 42.62% 23.15% 79.53% 80.34% 0.402 8 58C / T1D 42.99% 24.46% 76.60% 82.22% 0.370 8 UKBS / RA 44.29% 36.42% 66.21% 70.72% 0.144 10 58C / RA 44.66% 38.11% 64.89% 67.83% 0.135 9

Table 3.2: Separate training set classifier performance Baseline corresponds to the baseline error, Error, Precision and Recall to the cross- validation performance of the decision tree classifier, ∆p to the difference between the average disease-class probability of the control set, and the average disease-class probability of the disease set, and Leaves to the maximum number of leaves in the pruned classifiers for this experiment. R1 and R2 represent two random splits of the joint control set. two sets, we randomly split all control individuals into two sets, R1 and R2. We train a classifier to distinguish between these two sets. We find that this classifier does only minimally improves the classification error (error of 49.45%, baseline error of 50.03%, ∆p of -0.003). We apply the comparison step of our pipeline using the control-control classifier in order to identify possible similarities between the disease set and one of the control sets. Figure 3.4 shows the distribution of the average 58C class probabilities for each disease. The average disease-class probabilities obtained during cross-validation are 0.477 for the UKBS set and 0.561 for the 58C set. Both hypertension (average 58C class probability of 0.521, P-value < 10−5) and bipolar disease (average 58C class probability of 0.514, P-value of 0.0002) are closer to the 58C control set, whereas both rheumatoid arthritis (average 58C class probability of 0.487, P-value < 10−5) and coronary artery disease (average 58C class probability of 0.489, P-value of 0.0003) are closer to the UKBS control set. Given the differences between the control sets, and the unexpected similarities between control sets and diseases, we are interested in verifying that the performance CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 33

UKBS Controls 350 250 150 Frequency 50 0

0.0 0.2 0.4 0.6 0.8 1.0 58C class probability

Diseases

RA: 0.487 CAD: 0.489 T1D: 0.498 T2D: 0.499 CD: 0.503 BD: 0.514 HT: 0.521

0.468 0.48 0.5 0.52 0.54 0.561

58C class probability

58C Controls 400 300 200 Frequency 100 0

0.0 0.2 0.4 0.6 0.8 1.0 58C class probability

Figure 3.4: Distribution of the class probabilities for the control-control clas- sifier This classifier distinguishes the UKBS control set from the 58C control set. The two histograms show the distribution of the 58C class probability of the individuals respectively in the UKBS control set (top) and in the 58C control set (bottom), as computed during cross-validation. The red lines represent the average class prob- abilities, and the black line indicates the 0.5 probability cut-off used for binary classification. The plot in between the histograms shows the average disease-class probabilities of all seven other diseases on the interval between the average class probabilities of the two control sets. CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 34

of the disease-classifiers used in the analysis is not an artifact caused by these differ- ences. We therefore train two new classifiers for each disease, one using only UKBS as control set, and one using only 58C as control set. The performance of these classifiers for type 1 diabetes and rheumatoid arthritis is shown in Table 3.2, and is similar to the performance of the classifiers that use both control sets together. For the remaining diseases (including hypertension and bipolar disease), the classifiers using only one of the control sets do not achieve a classification error below the base- line error, most likely due to the smaller training set (i.e. overfitting). For each of the classifiers for type 1 diabetes and rheumatoid arthritis we compute the average disease-class probability for the other six diseases as well as the unused control set. The similarities between the two diseases are significant in all four classifiers. Further- more, the average disease-class probability of the unused control set is similar to the the average disease-class probability of the other five diseases, and not significantly closer to type 1 diabetes or rheumatoid arthritis. Therefore we can conclude that the results obtained using the type 1 diabetes and rheumatoid arthritis classifiers are not due to differences between the control sets. Furthermore, the results using a single control set provide further evidence indicating that the classifiers do identify relevant features of respectively type 1 diabetes and rheumatoid arthritis, rather than relevant features of the control set.

3.4 Discussion

In this work, we introduce a novel approach for identifying genetic similarities between diseases using classifiers. We identify genetic similarities between several diseases. In this section, we first discuss the implications of these findings. We then consider challenges in the application of classifiers to GWAS data. Finally, we propose possible extensions of this approach. We identify a strong similarity between type 1 diabetes and rheumatoid arthritis. Genetic factors that are common to these two autoimmune diseases were identified well before the advent of GWAS, and linked to the HLA genes [39], [40]. The original WTCCC study [10] identifies several genes that appear to be associated with both CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 35

diseases. We look at the classifiers corresponding to these two diseases. The SNP with the highest information gain in type 1 diabetes is rs9273363, which is located on chromosome 6, near MHC class II gene HLA-DQB1, and is also the SNP that is most strongly associated with type 1 diabetes in the initial analysis of the WTCCC data, with a P-value of 4.29 · 10−298 [41]. This is the strongest association reported for any disease in the WTCCC study, which explains to a large extent why the type 1 diabetes classifier so clearly outperforms the classifiers for the other diseases. This SNP is also significantly associated with rheumatoid arthritis (P-value of 6.74 · 10−11). The SNP with the highest information gain in rheumatoid arthritis is rs9275418, which is also part of the MHC region, and is strongly associated with both rheumatoid arthritis (P-value of 1.00 · 10−48) and type 1 diabetes (P-value of 7.36 · 10−126). This shows that our approach is able to recover a known result, and uses SNPs that have been found to be significantly associated with both diseases in an independent analysis of the same data. The similarity we identify between hypertension and bipolar disease is interest- ing, since there does not appear to be previous evidence of a link between the two diseases at the genetic level. However, a recent study identified an increased risk of hypertension in patients with bipolar disease compared to general population, as well as compared to patients with schizophrenia in the Dannish population [42]. The WTCCC study only identified SNPs with moderate association to hypertension (low- est P-value of 7.85·10−6) and a single SNP with strong association with bipolar disease (P-value of 6.29 · 10−8). The decision trees for both diseases use a large number of SNPs that have a very weak association with the respective disease. Both classifiers have a classification error that is clearly below the baseline error, and provide evi- dence of similarity between the two diseases. This indicates that our classifier-based approach is able to use the weak signals of a large number of SNPs to identify evidence for similarities that would be missed by comparing only SNPs that show moderate or strong association with the diseases. Further analyzes are necessary to identify the nature and implications of the similarity we find between hypertension and bipolar disease, as well as the weaker similarity we identified between these two diseases and type 1 diabetes. CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 36

We also show that we can train a classifier that can distinguish the two control sets, and we use it to identify diseases that are more similar to one of the control set than the other. This is not an unexpected finding, since SNPs that were strongly associated with a control set were identified and discarded in the WTCCC study. These SNPs were also removed in the pre-processing step of our study, and the results we obtain when trying to distinguish the two control sets therefore show that the decision tree classifier is able to achieve a classification error below the baseline error even though the SNPs with the strongest association could not be used by the classifier. The similarities between some diseases and one of the control sets can most likely be explained by some subtle data quality issue. During quality control, the authors of the WTCCC study found several hundreds of SNPs in which some data sets exhibited a particular probe intensity clustering (see the Supplementary Material of the original WTCCC study [10] for details). This particular pattern was always observed in 58C, BD, CD, HT, T1D, T2D, but not in UKBS, RA and CAD. This matches the result obtained using our classifier-based approach, in which RA and CAD were predicted to be most similar to UKBS, and could therefore be a possible explanation of the similarities we find. While we do find several interesting similarities between diseases, we also observe that training a classifier that distinguishes between individuals with a disease and controls using SNP data poses numerous challenges. The first is that whether someone will develop a disease is strongly influenced by environmental factors. The genetic associations that can be identified using GWAS are only predispositions, and it is therefore likely that some fraction of the control set will have the predispositions, but will not develop the disease. Furthermore, depending on the level of screening, the disease might be undiagnosed in some control individuals, and individuals that are part of a disease set might have other diseases as well. This is especially true for high prevalence diseases like hypertension. Obtaining good classifier performance by itself is not, however, the main goal of our approach. We show that we can find similarities even when the classifier performance only shows small improvements compared to the baseline error. In this work, we focus on the comparison approach, not on developing a classifier specially CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 37

suited for the particular task of GWAS classification. We use decision trees because they are a simple, commonly used classification algorithm. This work shows that classifiers can be used to identify similarities between dis- eases. This novel approach can be expanded into several directions. First, classifica- tion performance can be potentially improved by using a different generic classifier, or by developing classifiers that do take into account the specific characteristics of SNP data. Second, further analysis methods need to be developed in order to analyze the trained classifiers, and identify precisely the SNPs that do lead to the similarities this approach detects. Such a methodology would be useful, for example, to further analyze the putative similarity between hypertension and bipolar disease. Third, building on the fact that our approach considers the whole genotype of an individual, it could be possible to identify subtypes of diseases, and cluster individuals according to their subtype. Finally, modifying the approach to allow the integration of studies performed in populations of different origins or using different genotyping platforms would allow the comparison of a larger number of diseases. Our approach identifies similarities between the genetic architecture of diseases. This is however only one of the many axes along which disease similarities could be described. In particular, both genetic and environmental factors interact in diseases, and the genetic architecture for two diseases could be similar, but the environmental triggers could be different, leading to low co-occurrence. There is therefore a need for methods that integrate similarities of different kinds that were identified using different measurement and analysis modalities. An example of such an approach is the computation of disease profiles that integrate both environmental etiological factors and genetic factors [43]. In a study published after this work, and which is not part of this thesis, we present a separate method that integrates allele-specific effects into the study of similarities and differences in the genetic architecture of diseases [44]. We find that autoimmune diseases can be separated into two classes. Furthermore, we identify SNPs for which one allele increases the risk of having one autoimmune disease, but decreases the risk of having another autoimmune disease. CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 38

3.5 Conclusion

Genome-wide associations studies have been used to identify candidate loci likely to be linked to a wide variety of diseases. In this work, we introduce a novel approach that allows identifying similarities between diseases using GWAS data. Our approach is based on training a classifier that distinguishes between a reference disease and a control set, and then using this classifier for comparing several query diseases to the reference disease. This approach is based on the classification of individuals using their full genotype, and is thus different from previous work in which the independent statistical significance of each SNP is used for comparing diseases. We apply this approach to the genotype data of seven common diseases provided by the Wellcome Trust Case-Control Consortium, and show that we are able to iden- tify similarities between diseases. We replicate the known finding that there is a common genetic basis for type 1 diabetes and rheumatoid arthritis, find strong ev- idence for genetic similarities between bipolar disease and hypertension, as well as evidence for genetic similarities between type 1 diabetes and both bipolar disease and hypertension. We also find similarities between one of the control sets used in the WTCCC (UKBS) and two disease sets, rheumatoid arthritis and coronary artery disease. This similarity can possibly be a consequence of the subtle differences in genotyping quality that were observed during the initial quality control performed by the WTCCC. Our results demonstrate that it is possible to use a classifier-based approach to identify genetic similarities between diseases, and more generally between multiple phenotypes. We expect that this approach can be improved by using classifiers that are more specifically tailored for the analysis of GWAS data, and by the integration of a larger number of disease phenotypes. The ability to compare similarities between diseases at the whole-genome level will likely identify many more currently unknown similarities. Genetic similarities between diseases provide new hypotheses to pursue in the investigation of the underlying biology of the diseases, and have the potential to lead to improvements in how these diseases are treated in the clinical setting. CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 39

3.6 Methods

In this section, we first formally define the classification task that is central to our approach, then describe the specific classifier we use in this work and how we evaluate its performance, and finally describe how we use the classification results to infer relationships between diseases.

3.6.1 Classification task

The data consist of a list of individuals i, a list of SNPs s ∈ S, and the measurement of the genotype g(s, i) of individual i at SNP s. We use Gi = {g(1, i), ..., g(|S|, i)} to denote the genotype of individual i at all the SNPs in the study. The genotype measurement is a discrete variable which can take four values: homozygote for the major allele, homozygote for the minor allele, heterozygote and unknown: g(s, i) ∈ {maj, min, het, unk}. Each individual belongs to one of several disease sets, or to the control set. For the WTCCC data used in this work, we have seven disease sets: T1D, T2D, CAD, CD, BD, RA, HT, and we use the union of the 58C and UKBS sets as control set. For each disease d, we train a classifier that distinguish between that disease set and the controls. The individuals that are not part of these sets are ignored during the training of this classifier. For each individual i used during training, a binary class variable ci indicates whether the individual belongs to the disease set (ci == disease) or to the control set (ci == control). The supervised classification task consists of predicting the class ci of an individual i given its genotype Gi. In this work, we use a decision tree classifier, but any algorithm able to solve this classification task can be easily integrated into our analysis pipeline.

3.6.2 Decision trees

In this section, we describe the decision tree classifier [45]. We use cross-validation in order to train the classifier, prune the trained decision tree, and evaluate its perfor- mance on distinct sets of individuals. CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 40

We train a decision tree T by recursively splitting the individuals in each node using maximum information gain for feature selection. We use binary categorical splits, meaning that we find the best rule of the form g(s, i) == γ, where γ ∈ {maj, min, het}. Binary splits make it possible to handle cases in which only one of the three possible genotypes is associated with the disease without unnecessarily splitting individuals that have the two other genotypes. Unknown values are ignored when computing information gain. This is necessary since there is a correlation between the frequency of unknown values and the quality of the genotyping, which in turn is variable between the different data sets. Counting unknown values during training could therefore lead to classifiers separating the two sets of individuals based on data quality differences, rather than based on genetic differences. However, if a large number of measurements are unknown for a given SNP, the information gain for that SNP will be biased. This is particularly true if the fraction of unknowns is very different between the cases and the controls. In order to avoid this situation, we discard all SNPs that do have more than 5% of unknown genotypes amongst the training individuals in the node we are splitting. In each leaf node L, we compute the fraction fL of training individuals in that node of that are part of the disease class: P (c ==disease) i∈L i fL = |L| . In order to choose a pruning algorithm, we compare the cross-validation perfor- mance obtained using Cost-Complexity Pruning [45], Reduced Error Pruning [46], as well as a simple approach consisting of limiting the tree depth. We find that Reduced Error Pruning outperforms Cost-Complexity Pruning, and performs similarly well than limiting the tree depth, but results in smaller decision trees. We therefore use Reduced Error Pruning, which consists of recursively eliminating subtrees that do not improve the classification error on the pruning set (which only contains individuals that were not used during training). The classification of an individual i using a decision tree T is done by traversing the tree from the root towards a leaf node L(i) according to the genotype of the individual which is classified. If fL(i) is greater than 0.5, then the individual is classified as dis- ease, else the individual is classified as control. We can consider the decision tree T as a high level statistical model of the difference between the disease and the control sets. CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 41

Under this model, the fraction fL(i) represents the conditional probability of individ- ual i to be part of the disease class given its genotype: PT (ci == disease | Gi) = fL(i). This value is the disease-class probability of individual i. In order to compute the fractions fL over sufficiently large numbers of individuals, we further prune our tree to only have leaf nodes containing at least 100 training individuals. The benefit of using this probability rather than the binary classification is that it allows to distinguish leaf nodes in which there are mainly training individuals from one class from those in which both classes are almost equally represented. In order to assess the performance of our classifier, we perform 5-fold cross- validation. We start by separating the data into five random sets containing 20% of the individuals each. A decision tree T is trained using four of these sets, while one set is reserved for pruning and testing. The unused set is split randomly into two equal sets. The first of these sets is used to obtain pruned tree T 0 from tree T , and the individuals in the second set are used to evaluate the performance of tree T 0. The last step is then repeated using the second set for pruning, and the first for testing. Finally, we repeat the training and evaluation four more times, each time leaving out a different set for pruning and testing. This ensures that for every individual in our data set, there is one pruned decision tree for which the individual was used neither for training nor for pruning. We can therefore evaluate the performance of the classifier on unseen data. We can also compute the average disease-class proba- bility p(C) of the control individuals, and the average disease-class probability p(d) of the individuals with disease d. The difference ∆p between those two probabilities indicates how well the classifier is able to distinguish controls from diseases. We use the cross-validation results to compare the performance of the classifier against a baseline classifier which simply assigns the most frequent label amongst the training set to all individuals. Classifiers that do not outperform this baseline classifier, or for which the difference ∆p is small, are not used to identify similarities between diseases. Given the cross-validation scheme used, we end up training not one, but several possibly distinct decision trees. Rather than arbitrarily choosing one, we use the set Td of all decision trees trained during cross-validation for a given disease d. In order to classify a new individual i, we first classify i using each classifier independently, and CHAPTER 3. IDENTIFYING SIMILARITIES BETWEEN DISEASES 42

then return the average classification. Similarly, we average the results of individual classifiers to obtain the average disease-class probability: PT (ci == disease | Gi) = P d PT (ci==disease | Gi) T ∈Td . |Td|

3.6.3 Identifying similarities

Once a classifier has been trained to distinguish the set of individuals with reference disease d from the control set, we can use it to identify diseases that are similar to disease d. Using the classifier, we can compute the disease-class probability of an individual with a query disease d0. In order to be able to compare diseases, we are interested in computing the average disease-class probability of all individuals P P (c ==disease | G ) 0 0 i∈d0 Td i i in d : p(d ) = |d0| . We expect this average probability to be in, or close to the interval between p(C) and p(d0), which were the averages computed on respectively the control set and the disease set d during cross-validation. If p(d0) is close to p(C), then d0 is not very different from the control set, whereas a value p(d0) that is close to p(d) indicates similarity between the two diseases. Using this method, we can compare all query diseases to the reference disease d, and identify if there are diseases that are more similar to d than others. If we find that a query disease d0 is closer to reference disease d than the other query diseases, then we need to assess the significance of this finding. In order to do so, we randomly sample a set r of individuals from all the disease sets except d, such that r is of the same size as d0, and compute p(r). We repeat this procedure 10,000 times. The fraction of random samples r for which p(r) ≥ p(d0) indicates how often a random set of individuals would obtain a probability of being part of the disease-class at least as high as the set d0, and is therefore a P-value indicating how significant the similarity between d0 and d is. Chapter 4

Analysis of the Pseudoautosomal Regions

4.1 Introduction

In human, the sex chromosomes X and Y represent the major genetic difference be- tween males and females. In general, females have two copies of the X chromosome, whereas males have one copy of the X chromosome and one copy of the Y chromo- some. The Sex-determining region Y gene SRY [47] is located on the Y chromosome and has been shown to initiate testis development, which leads to a male phenotype. The two sex chromosomes evolved from autosomes and differentiation started early in the mammalian lineage [48]. In human, sequence homology between the X and Y chromosomes is mostly low, and the X chromosome is substantially longer (150 mega base pairs) than the Y chromosome (50 mega base pairs). There are, how- ever, two regions of high sequence homology between the sex chromosomes called Pseudoautosomal Regions (PAR). The Pseudoautosomal Region 1 (PAR1 ) is located at the distal end of the p arm of the X and Y chromosomes, and measures 2.6 mega base pairs [49]. The Pseudoautosomal Region 2 (PAR2 ) is located at the distal end of the q arm of the X and Y chromosomes, and is much shorter than PAR1 (320 kilo base pairs) [50, 51]. As their name indicates, pseudoautosomal regions be- have in similar way than autosomes. In particular, given the sequence homology

43 CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 44

between the two sex chromosomes in the PARs, recombination can happen in these regions [52, 53]. An obligatory recombination event does happen in PAR1 during male meiosis [54]. A specific mechanism underlying this obligatory recombination has been identified in mice [55]. The Pseudoautosomal Boundary (PAB) separates the PARs from the chromosome-specific regions of the sex chromosomes, which can- not recombine [56]. The PAB results from the presence of an Alu repeat sequence on the Y chromosome [57], an event that happened after the divergence between the Old World monkeys and great ape lineage [58]. This mechanism is critical for the XY sex-determination system to work, as recombination between the X and the Y chromosomes outside of the PAR would result in X chromosomes carrying SRY. It is therefore interesting to note that SRY is located only 5.3 kilo bases proximal of the PAB separating PAR1 from the male specific region of the Y chromosome. Several important genes are located in the PARs [59, 60]. These genes are not affected by X-inactivation [61], and therefore behave like autosomal genes. Further- more, the XG gene, which encodes a blood group antigen [62], includes exons located on both sides of the PAB [63]. Three exons on the 5’ end of the gene are located in PAR1, and therefore present on both sex chromosomes. The remaining exons of the functional copy of the gene are located on the X chromosome only [64, 65]. A trun- cated version of the gene, XGPY appears to be transcribed from the Y chromosome, but is not known to be functional [66]. In this work, we investigate differences between males and females at common Single Nucleotide Polymorphisms (SNPs) located in the PARs. In order to do so, we re-purpose Genome Wide Association Studies (GWAS) data originally collected for the purpose of identifying disease associations. We separate the individuals in the control sets according to their sex, and identify significant differences in genotype fre- quencies between males and females for SNPs located in PAR1. We show that for the most significant association, SNP rs312258, the genotype frequency differences can be explained by differences in minor allele frequency between the X and Y chromosome. We reproduce this result in the HapMap 3 [67] Utah residents with ancestry from northern and western Europe (CEU) population. We hypothesize that the difference results from different selective pressures acting on the X and Y chromosomes. CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 45

PAR1 Chr. X Chr. Y chrX 0 5 −

value) rs311150 − rs311149 rs2535443 Log(P 10 −

rs311161

15 rs312258 −

0 500000 1000000 1500000 2000000 2500000 3000000

Position

Figure 4.1: Manhattan plot of differences between males and females in PAR1 The red line indicates the pseudoautosomal boundary (PAB).

4.2 Results

4.2.1 Significant differences between males and females in the pseudoautosomal region 1

We use all control individuals, and split them according to the sex information re- ported by WTCCC. We then perform an analysis that is similar to a traditional CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 46

GWAS, except that we compare males to females rather than cases to controls. We identify several genome-wide significant differences between males and females in PAR1. Figure 4.1 shows a Manhattan plot of all SNPs in PAR1. SNPs that show significant differences in genotype frequencies between males and females appear close to the PAB. Table 4.1 shows the genotype counts and P-values for all genome-wide significant SNPs identified in PAR1. The most significance difference is observed for rs312258, a SNP located in the last of gene CD99 (P-value of 5.91 · 10−16), and 41 kilo bases distal to the PAB. The frequency of the minor allele (A) in females is pf = 0.22, compared to pf = 0.31 in males. This leads to an allelic odds ratio of 1.7.

SNP Allele Female Male P-value A allele frequency A B AA AB BB AA AB BB Female Male rs312258 A C 74 487 905 117 641 654 5.91 · 10−16 0.22 0.31 rs311161 A C 673 648 157 830 538 74 6.95 · 10−13 0.67 0.76 rs2535443 A G 418 730 337 525 703 211 1.31 · 10−9 0.53 0.61 rs311149 A T 323 720 403 207 701 508 7.43 · 10−9 0.47 0.39 rs311150 C T 420 710 327 508 691 208 3.73 · 10−8 0.53 0.61

Table 4.1: Significant genotype differences between males and females in WTCCC controls

We also identify several genome-wide significant SNPs on autosomes, which are however all false positives caused by genotype calling artifacts (see Section 2.3).

4.2.2 Replication

We first attempt to replicate the result we observe for rs312258 in the control pop- ulations using disease populations from the WTCCC study. Results are shown in Table 4.2. For each disease, we first assess the association between rs312258 and the disease separately in males and females. We show that there is no genome-wide signif- icant association for any disease in either sex. The disease data sets can therefore be used as replication cohorts for the purpose of showing a difference between males and females in general. We observe significant differences in most disease populations. We combine the P-values obtained in the disease populations and the control population using Fisher’s combined probability test [68], and obtain a meta-analysis P-value of CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 47

1.87 · 10−49.

Population Female Male Female vs male Case vs control AA AC CC AA AC CC Female Male Controls 74 487 905 117 641 654 5.91 · 10−16 Bipolar disease 61 386 678 63 284 319 4.78 · 10−7 0.73 0.43 Coronary artery disease 20 122 242 113 588 643 9.02 · 10−7 0.86 0.71 Crohn’s disease 48 348 621 54 293 304 2.40 · 10−8 0.84 0.99 Hypertension 55 410 699 67 306 390 2.20 · 10−5 0.55 0.064 Rheumatoid arthritis 91 475 788 33 190 219 0.0056 0.06 0.50 Type 1 diabetes 39 362 548 109 435 446 1.86 · 10−11 0.037 0.073 Type 2 diabetes 40 297 455 93 514 486 6.89 · 10−8 0.12 0.61

Table 4.2: Analysis of rs312258 in WTCCC disease populations

As both the controls and the disease data sets were collected as part of the same WTCCC study, they share the same genotyping platform, genotype calling algorithm and quality control pipeline. Each of these parts could lead to artifacts that may explain the association. It is therefore important to replicate the association using a completely different data set. We do so in the HapMap 3 CEU population, which like the WTCCC populations is of European descent. We observe a significant difference between male and females for rs312258 (P-value of 0.0074). Given the small number of individuals in HapMap 3, this P-value would not reach genome-wide significance if all SNPs in HapMap 3 had been tested. It is however significant for the purpose of replicating this individual result. Table 4.3 shows the genotype counts for rs312258 in the HapMap 3 CEU population. The minor allele frequencies in this population are comparable to those observed in the WTCCC controls (0.23 for females, 0.32 for males).

AA AC CC Female 5 15 35 Male 2 31 22

Table 4.3: Genotype counts for rs312258 in HapMap 3 CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 48

4.2.3 Modified Hardy-Weinberg model

In order to identify possible causes for the observed genotype frequency differences we compare the observed genotype frequencies in both sexes to the frequencies expected under the Hardy-Weinberg model [69, 70], in which the allele and genotype frequen- cies remain stable from one generation to the next. Deviations from the genotype frequencies expected under Hardy-Weinberg are a symptom of many disturbances, such as positive selection or non-random mating. For a given SNP, the Hardy-Weinberg model assumes that both alleles of an indi- vidual are drawn from the same pool of possible alleles. If the frequency of the minor allele in the population at the previous generation is p (and the major allele frequency therefore (1−p)), then the probability of an individual in the current generation to be homozygote for the minor allele will be p2. Similarly, the probability of an individual to be homozygote for the major allele will be (1 − p)2, and the probability of an individual to be heterozygote will be 2 · p · (1 − p). These probabilities correspond to the estimated frequency of each genotype in the population at the current generation. It can easily be shown that the allele frequencies remain unchanged. We compare the observed genotype counts for rs312258 in males and females in the WTCCC controls to the expected genotype counts under Hardy-Weinberg, using the observed minor allele frequencies for each sex (Table 4.4). We compute the percentage absolute difference between the observed and the expected genotype counts. We note that this difference is larger in males (5.1%) than in females (1.5%).

AA AC CC Difference Female 74 487 905 Hardy-Weinberg 68 497 899 1.5% Male 120 658 675 Hardy-Weinberg 138 620 693 5.1% Modified model 126 645 681 1.7%

Table 4.4: Comparison of observed and estimated genotype counts

The Hardy-Weinberg model as described above does assume that both alleles of the SNP are drawn from the same pool of alleles. This assumption is generally valid CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 49

in the context of autosomes, as well as for the non-PAR regions of the X chromosome. The Hardy-Weinberg model does not apply to the male specific region of the Y chro- mosome since a male will only have one copy of the chromosome. In the context of Hardy-Weinberg, the pseudoautosomal regions do not behave exactly like autosomes. It is possible that, for a given SNP, the minor allele frequency is different on the X chromosome than on the Y chromosome. This will not change the Hardy-Weinberg model for females, who will draw two alleles from the X chromosome pool. However, for males one allele will be drawn from X chromosome pool and one from the Y chromosome pool. A female receives one X chromosome from a male and one from a female, and can transmit either chromosome to a male or a female. We can thus assume that there is a single X chromosome pool. We therefore need to modify the

Hardy-Weinberg model for males. We consider pX as the minor allele frequency of the SNP on the X chromosome, and pY as the minor allele frequency of the SNP on the Y chromosome. Since our model uses a single pool for the X chromosome, we can use the observed minor allele frequency in females as an estimate of the minor allele frequency on the X chromosome: pX = pf . Since we have pm = (pX + pY )/2 we can compute an estimate of the minor allele frequency on the Y chromosome: pY = 2 · pm − pX . The modified genotype frequencies under this model are pX · pY for homozygous for the minor allele, pX · (1 − pY ) + (1 − pX ) · pY for heterozygous, and

(1 − pY ) · (1 − pX ) for homozygous for the major allele. We apply this method to the male controls WTCCC data set, and obtain an estimated minor allele frequency for rs312258 on the Y chromosome of pY = 0.40, compared to pX = 0.21 on the X chromosome. Using the modified Hardy-Weinberg model, we reduce the percentage absolute error to 1.7% (Table 4.4). The observation that a modified Hardy-Weinberg model that allows different minor allele frequencies on the X and Y chromosome fits the observed data better than the original Hardy- Weinberg model is evidence that the minor allele frequency at rs312258 likely differs between the two sex chromosomes. CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 50

4.2.4 Phasing

We can use the modified Hardy-Weinberg model to estimate the minor allele fre- quencies on the X and Y chromosomes in the population based on the genotype information only. Genotyping platforms cannot directly indicate which allele is on which physical chromosome if the individual is heterozygous. The task of assigning alleles to individual chromosomes is called phasing. Phasing would be particularly useful in the context of the PARs in males, as it would determine which allele is on the X chromosome, and which allele is on the Y chromosome. Computational approaches have been developed to phase genotype data on autosomes [71], but these approaches have not been designed to handle the particular case of the PARs. In par- ticular, such approaches do not integrate the differences in allele frequency between chromosomes that we identify herein. The assumptions underlying the models used in existing phasing methods are therefore not applicable to the case we’re interested in, and we cannot use them in order to phase individuals genotyped in the WTCCC study. The HapMap 3 data sets does, however, include data for trios (mother, father, child). This makes it possible to exactly determine which allele a male child has on each sex chromosome in most cases (see Section 4.5.2). We can therefore compute the actual frequencies of the minor allele on the X and Y chromosomes, and compare them to the estimated frequencies obtained on the WTCCC controls. Table 4.5 shows the counts for each allele on each sex chromosome in the HapMap 2 CEU population. The frequencies of the minor alleles are comparable to the estimates obtained on the WTCCC controls: pY = 0.34, and pX = 0.21. Given the small number of individuals in HapMap 3 this difference is not statistically significant on its own (Fisher’s exact test P-value of 0.10), but this result provides additional evidence supporting a difference in allele frequencies between males and females at rs312258.

4.2.5 Evolution of differing allele frequencies

We observe a difference in minor allele frequency at rs312258 between the X and the Y chromosomes in the current population. In this section, we build a model of the CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 51

A C Y 15 28 X 27 99

Table 4.5: Allelle counts per chromosome for rs312258 in HapMap 3 evolution of this difference over time under the assumption that this region is not under selective pressure. Let’s consider that the frequencies of the minor allele A on the X and Y chromosome at the current generation are respectively pX and pY , and the same frequencies in the next generation as pY 0 and pX0 . During male meiosis I, exactly one recombination event between X and Y must happen in PAR1 [54]. As this unique recombination event recombination happens at a stage of meiosis during which there are two sister chromatids for each sex chromosome, there is a 50% chance that no recombination happens on the sex chromosome transmitted from the father to the child. We first consider the case in which the child is male. If the recombination event is distal to the SNP, or not on the transmitted chromosome, then we have pY 0 = pY . If the recombination event happens between the SNP and the PAB, then the allele that was on the X chromosome will be on the Y chromosome passed on to the male child, and we have pY 0 = pX . If we consider r as being the probability of a recombination happening between the SNP and the PAB, then we obtain pY 0 = (1 − r) · pY + r · pX . Since we have pY > pX the minor allele frequency on the Y chromosome will decrease: pY 0 < pY . The case in which the child is female is similar. However only one third of the X chromosomes in the next generation are passed on from a male, and therefore 2/3 of the X chromosomes will not recombine with Y between two consecutive generations. The minor allele frequency on the X 2 1 0 chromosome in the next generation is therefore pX = 3 ·pX + 3 ·((1−r)·pX +r·pY ) > pX . Therefore the difference between the minor allele frequencies will decrease over time. It is thus surprising to observe such a large difference between males and females at rs312258 in the current population. CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 52

4.3 Discussion

The discovery of SNPs that show significant differences in genotype frequencies be- tween males and females appears surprising at first. In the context of a traditional case-control GWAS, a SNP showing a difference between the two populations as strong as rs312258 (P-value of 5.91 · 10−16, odds ratio of 1.7) would be a very signifi- cant association. The interpretation would be that the risk allele confers a statistically significant increase in disease risk. Furthermore, this association would likely be seen as causal: the SNP is either itself involved in a biological process of relevance for the disease, or is a proxy (through linkage disequilibrium) of a nearby SNP involved in the disease. In the context of a complex disease, the SNP would likely be one of many factors contributing to the overall disease risk of an individual. We are, however, not comparing cases to controls, but males to females, and most of these interpretations are not applicable. In particular, having an A allele at rs312258 does not contribute to the overall probability of an individual to be a male, even though the A allele is sig- nificantly more prevalent in males. The SRY gene on chromosome Y has been shown to be the sole determinant of sex in human, and therefore a causal role of rs312258 in sex determination can be ruled out. Unlike in a traditional GWAS, the observed genotype is not part of the cause of the studied phenotype, but is its consequence. We show that the reason for the difference in genotype frequencies is likely to be a difference in minor allele frequency between the X and the Y chromosome. The possibility of such a difference has been hypothesized in theoretical models of the sex chromosomes [72, 73], but to our knowledge, this study is the first to report an observed difference in human. Both existing models are consistent with our analysis of the evolution of allele frequencies, and show that even small recombination rates are sufficient to keep the X and Y chromosomes similar unless there is active selection. This leads to two potential explanations for the differences we observe: extremely low recombination, and selection. Given the necessity of one recombination event per male meiosis within the com- parably small PAR1, the overall recombination rate in this region is high. Estimates in males indicate that it can be as high as 20 centi Morgan per mega base pairs CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 53

(cM/MB), whereas the recombination rate in females (0.30-2.55 cM/MB) is similar to the recombination rate observed on the whole X chromosome in females (1.21 cM/Mb), as well as on autosomes (1.40-2.80 cM/Mb) [74]. Estimating precise genet- ics maps of the PARs has been challenging as discussed in a recent review by Flaquer et al. [74]: genetic maps built from families lack resolution, genetic maps inferred from unrelated individuals cannot distinguish males from females, and sperm typing studies suffer from the high variability between individuals. Early analyzes of the recombination in PAR1 have shown a linear decrease in recombination rates when moving from telomeres towards the PAB [75, 76, 77, 78]. It is interesting to note that all the significant SNPs we identify on PAR1 are located close to the PAB. If the recombination rate in the region between those SNPs and the PAB is low, then it is possible that allele frequency differences between males and females can remain in the population for a long period of time. This explanation alone is, however, insufficient as it does not indicate why such a difference would exist in the first place, specially given that the XY sex-determination system, and the PAR1 itself, date back to the early primate lineage. Selective forces are likely to be playing an active role in the evolution of PAR1, and to account for most of the differences we observe. The Y chromosome has recently been shown as being one of the fastest evolving region of the human genome, with significant differences between human and the closest relative chimpanzee [79]. Furthermore, a comparison between human and macaque Y chromosomes shows that events that suppressed recombination between segments of X and Y lead to a rapid gene loss on the Y chromosome, followed by evolutionary conservation [80]. The SRY gene itself has undergone rapid sequence evolution in the mammalian lineage [81]. Various models of sex-specific selection that incorporate difference selective forces in males and females have been proposed [82]. The integration of these models in the context of a region in which one chromosome is only present in males can lead to persistent differences in allele frequencies [83]. Only alleles that are beneficial to males will be selected for on the Y chromosome, whereas selection on the X chromosome will be stronger for alleles that are beneficial to females. Whether rs312258 does itself confer any selective advantage to either sex remains unclear. While the SNP is CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 54

located in the last intron of gene CD99, we do not find any overlap between the SNP and experimental functional data generated by the ENCODE consortium (using the methods described in Chapter 5). An alternative explanation for the difference at rs312258 would be that the SNP is not itself under selection, but in linkage disequilibrium with regions undergoing selection. In particular, given the proximity of the PAB and the observed reduction in recombination close to the PAB, the probability of a recombination event between the SNP and the PAB is low. Therefore, it is possible for the minor allele at rs312258 to rise in frequency on the Y chromosome due to positive selection acting anywhere on the male-specific region of the Y chromosome, which is not subject to any recom- bination. This effect is similar to the so-called genetic hitchhiking [84] that has been observed on autosomes: mutations that have no beneficial effect themselves, but hap- pen to be physically located close to a beneficial mutation rapidly rise in frequency in a population, since recombination is unlikely to break the correlation between the two loci. The difference in this case is that a mutation in the PAR that is hitchhiking due to its correlation with a beneficial mutation on Y will increase in frequency on the Y chromosome only. Similarly, it is possible that a beneficial mutation that is located in the non-PAR region of chromosome X but close to the PAB could lead to an increase in frequency of a mutation in the PAR on the X chromosome only. Finally it is possible that these effects all happen simultaneously: selection on the Y chromosome could lead to an increase in the frequency of one allele of a SNP in the PAR, whereas selection on the X chromosome (which in turn can be different between males and females) could lead to an increase in the frequency of the other allele of the same SNP in the PAR. We do not identify any significant difference in genotype frequencies between males and females in PAR2. While the two PAR share the basic property of sequence homolgy between the X and Y chromosomes, there are also major differences between them. PAR2 appeared much more recently in evolution, and is unique to humans [51, 85]. Recent evidence shows that the mechanism underlying recombination in PAR2 may be significantly different from PAR1 [86]. This work shows that sex chromosomes are both a highly complex, and not well explored part of the human genome. While CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 55

we focus on differences at the level of individual SNPs, recent work has shown the existence of gene conversion between the X and Y chromosomes [87], and a complex interplay between gene transmission differences and gene silencing [88]. In previous work, we have studied differences in disease associations between males and females using the same WTCCC data set [89]. In particular we identified a signifi- cant sex difference in the association between SNP rs3792106 and Crohn’s disease [90]. This SNP is located in gene ATG16L1 on chromosome 2. We showed that the dif- ference is due to a difference in genotype frequencies in the controls, but not in the cases, and that the difference is caused by transmission distortion from the mother to the child. This result is consistent with transmission distortion observed in the same region in the Framingham cohort population [91]. While the difference in genotype frequency for rs3792106 is significant when considering this SNP alone, it does not reach genome-wide significance. In this work we do not identify any genome-wide sig- nificant difference in genotype frequencies between males and females on autosomes, and to our knowledge no difference of the magnitude we observe in PAR1 has been identified on autosomes. The biology underlying potential differences on autosomes remains unclear. The model we discuss herein for the PAR1 does not apply to au- tosomes, since each chromosome can be passed on from a male to a female and vice versa unless there is transmission distortion. In this work we identify a genetic difference between males and females using data collected for purpose of comparing cases to controls in a GWAS. We show that GWAS data can be successfully re-purposed to answer new, interesting questions without incurring any genotyping cost. This is, however, only possible because the relevant phenotype information, here sex, has been collected in the original study. Most GWAS data sets only provide a limited amount of additional phenotype information, often related to possible confounding variables such as age, sex or geographic origin of an individual. If additional variables had been recorded, then it would be possible to ask many more interesting questions by simply reanalyzing existing genotype data. Collecting additional phenotype information such as height or weight would only marginally increase the cost of the study, and would allow asking new questions, either together with the question of interest in the original GWAS, or in a completely CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 56

orthogonal manner, as done in this work. For example, a recent large meta-analysis of 46 studies and 133,653 individuals has lead to the identification of 180 markers associated with height [92], markedly more thank the number of currently known associations for any other phenotype. Yet the number of individuals that could be used to identify associations with height would be much larger if a meta-analysis could be performed across all individuals genotyped in any GWAS so far. While privacy concerns would likely limit the feasibility of such an endeavor, this example shows the potential that could be unlocked by asking questions about genotype data that go beyond the narrow focus of an individual study.

4.4 Conclusion

In this chapter, we show that by re-purposing GWAS data, we can identify significant differences in genotype frequencies between males and females in the pseudoautosomal region located on the distal end of the short arm of chromosomes X and Y. We provide evidence showing that these differences are caused by differences in allele frequencies between the X and Y chromosome, and we hypothesize that selective pressure acting on the sex chromosome may explain these differences.

4.5 Methods

4.5.1 Identifying differences between males and females

For each SNP, we count the number of individuals for each of the three possible geno- types (homozygote major allele, heterozygote, homozygote minor allele) separately for each sex, resulting in a 3x2 table. We apply a Chi-square test with two degrees of freedom in order to assess whether there is a difference in genotype frequency between males and females. We apply a Bonferroni correction in order to compute a genome- wide significance level threshold of 0.05 · 459, 075 = 1.1 · 10−7. We then manually analyze all SNPs that reach a genome-wide significance level. In particular, we man- ually inspect the genotyping intensity plots in order to detect potential artifacts in CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 57

genotype calling. Such artifact can be caused by the homology between the sequence in the immediate vicinity of an autosomal SNP and a region of either the X or the Y chromosome. This scenario is described in more details in Chapter 2. We repeat this analysis for each of the disease populations separately. We then combine the P-values in order to assess the significance of the difference in the entire WTCCC data set. We use Fisher’s combined probability test [68]. Given a list of k P-values pi (in our case k = 8 since we have seven disease populations and one 2 Pk control population) we can compute the summary statistic X = −2 i=1 loge(pi). This statistic has a Chi-squared distribution with 2 · k degrees of freedom. We apply the same test to the HapMap 3 CEU individuals. As these individuals are parts of trios, we can only use the parents because the genotype of the child is not independent of the genotype of the parents.

4.5.2 Trio-based phasing in PARs

The HapMap 3 data set consists of genotype information for trios (mother, father, child). We know that for all female individuals, both alleles are present on X chro- mosomes. We are interested in determining the phasing of male individuals in order to know whether an allele is present on the X or the Y chromosome. We can use trios in order to determine the exact phasing of a male child in most cases using the information of individual SNPs only. If the child is homozygous, then the phasing is trivial since both the X and the Y chromosome contain the same allele. If the child is heterozygous and at least one parent is homozygous, then we know which allele this parent passed on to the child. If the mother is homozygous, then the allele the mother is homozygous for must be on the X chromosome of the child. If the father is homozygous, then the allele the father is homozygous for must be on the Y chromo- some of the child. If all three individuals are heterozygotes, then the exact phasing cannot be determined using the information provided by the single SNP alone. Furthermore, if we can determine which allele the father passed on to the male child, then we also know which allele did not get transmitted. It is important to note that this allele is not necessarily on the X chromosome in the somatic cells CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 58

of the father if the father is heterozygous. During meiosis I, the sex chromosomes undergo recombination in the PARs. The chromosome passed on to the child is the result of this recombination, and at each SNP, one allele is passed on to the child. The other allele ends up on a chromosome that becomes part of another gamete. Therefore, if the physical location of the recombination event is between a given SNP and the PAB, then an allele that was on the X chromosome of the father will end up on the Y chromosome of the male child. In this case, we know that there was an X chromosome in a gamete that contained the allele not passed on to the child, but it would be incorrect to conclude that the allele that wasn’t passed on to the child is on the X chromosome of the father. Such a situation is however unlikely for the specific case of rs312258, which is located only 40 kilo bases from the PAB, whereas PAR1 is 2.6 mega bases long. Therefore in most cases the allele that is not transmitted from the father is on the X chromosome of the father. Furthermore, even if a recombination even happened and the non-transmitted allele was on the Y chromosome of the father, there is no specific reason to believe that the gamete which contained the allele on the X chromosome couldn’t have lead to viable individual, specially given the high minor allele frequency in both sexes. We choose to use the information obtained from the non-transmitted allele of the father as well. For each trio in which the child is a male and at least one individual is heterozygote, we can therefore identify which three independent alleles are on X chromosomes and which allele is on a Y chromosome. If the child in the trio is a female, then we know that both alleles are on X chro- mosomes. We can however apply a similar approach than for male children in order to identify which allele the father passed on to the daughter on the X chromosome, and which allele ended up on a Y chromosome in a gamete. We can therefore also identify which three independent alleles are on X chromosomes and which allele is on a Y chromosome unless all three individuals are heterozygote. Table 4.6 provides an overview of all possible combinations of genotypes, together with the four alleles we can assign to an X or a Y chromosome in each case. CHAPTER 4. ANALYSIS OF THE PSEUDOAUTOSOMAL REGIONS 59

Mother Father Male Child Child X Child Y Mother other X Father Gamete X AA TT AT A T A T AA AT AT A T A A* AT TT AT A T T T AT AT TT T T A A* TT AT TT T T T A* AT TT TT T T A T AT AT AT ? ? ? ?

Mother Father Female Child Child Xm Child Xp Mother other X Father Gamete Y AA TT AT A T A T AA AT AT A T A A* AT TT AT A T T T AT AT TT T T A A* TT AT TT T T T A* AT TT TT T T A T AT AT AT ? ? ? ?

Table 4.6: Phasing cases in trios

XM and XP represents the allele passed on to the daughter on the X chromosome by respectively the mother and the father. A star next to the father gamete indicate cases in which the allele is on the other sex chromosome in somatic cells of the father if the recombination event happened between the SNP and the PAB. Cases that are symmetrical (by swapping the A and T alleles) are not presented. In the cases where all individuals are heterozygous, the only inference that can be made is that all alleles in females are on X chromosomes. Chapter 5

Integrating Regulatory Information

5.1 Introduction

Although Genome-wide Association Studies (GWAS) provide a list of SNPs that are statistically associated with a phenotype of interest, they do not offer any direct evidence about the biological processes that link the associated variant to the pheno- type. A major challenge in the interpretation of GWAS results comes from the fact that most detected associations point to larger regions of correlated variants. SNPs that are located in close proximity in the genome tend to be in linkage disequilib- rium (LD) with each other [6, 7], and only a few SNPs per linkage disequilibrium region are measured on a given genotyping platform. Regions of strong linkage dis- equilibrium can be large, and SNPs associated with a phenotype have been found to be in perfect linkage disequilibrium with SNPs several hundred kilo bases away. Although sequencing can be used to assess associated regions more precisely [93], using sequence information alone is insufficient to distinguish among SNPs that are in perfect linkage disequilibrium with each other in the studied population, and thus equally associated with the phenotype. Various approaches have been developed to identify variants that are likely to play an important biological role. Most of these approaches focus on the interpretation of coding or other SNPs in transcribed regions [94, 95, 96]. The vast majority of associated SNPs identified in GWAS, however, are in non-transcribed regions, and it

60 CHAPTER 5. INTEGRATING REGULATORY INFORMATION 61

is likely that the underlying mechanism linking them to the phenotype is regulatory. SNPs that influence gene expression (Expression Quantitative Trait Loci, eQTLs, [97, 98] have been shown to be significantly enriched for GWAS associations [99, 100]. Although eQTLs can be used to identify the downstream targets that are likely to be affected by associations identified in a GWAS, they are still based on genotyping methods, and therefore also point to regions of linkage disequilibrium rather than to individual SNPs. Methods for identifying SNPs that overlap regulatory elements, such as transcription factor binding sites, are therefore necessary. Approaches based on known transcription factor binding motifs [101, 102] have been successfully used to refine GWAS results and identify specific loci that have a functional role [103, 104]. However, the presence of a motif does not imply that a transcription factor is necessarily binding in vivo. High-throughput functional assays such as chromatin immunoprecipitation assays followed by sequencing (ChIP-seq) [105, 106] and DNaseI hypersensitive site [107] identification by sequencing (DNase-seq) [108, 109] can experimentally detect func- tional regions such as transcription factor binding sites. Experimental evidence shows that the presence of SNPs in these regions leads to differences in transcription factor binding between individuals [110]. A SNP that overlaps an experimentally detected transcription factor binding site and is in strong linkage disequilibrium with a SNP associated with a phenotype, is thus more likely to play a biological role than other SNPs in the associated region for which there is no evidence of overlap with any func- tional data. Several recent analyses of associated regions use these types of functional data in order to identify functional loci in individual diseases [111, 112, 113, 114]. A recent study of chromatin marks in nine different cell lines produced a genome-wide map of regulatory elements, and showed a two-fold enrichment for predicted enhancers amongst the associated SNPs from GWAS [115]. These examples illustrate the power of combining statistical associations between a region of the genome and a phenotype together with functional data in order to generate hypotheses about the mechanism underlying the association. The main goal of the Encyclopedia of DNA Elements (ENCODE) project is to CHAPTER 5. INTEGRATING REGULATORY INFORMATION 62

identify all functional elements in the human genome, including coding and non- coding transcripts, marks of accessible chromatin and binding-sites [116, 117, 118]. The datasets generated by the ENCODE consortium are therefore particularly well suited for the functional interpretation of GWAS results. To date a total of 147 different cell types have been studied using a wide variety of experimental as- says [119]. Chromatin accessibility has been studied using DNase-seq, which lead to the identification of 2.89 million DNaseI hypersensitive sites that may exhibit regulatory function. DNase footprinting [120, 121, 122] was used to detect binding between and the genome at a nucleotide resolution. ChIP-seq experiments were conducted for a total of 119 transcription factors and other DNA-binding pro- teins. Together these data provide a rich source of information that can be used to associate GWAS annotations with functional data. In this work, we show that data generated by the ENCODE consortium can be suc- cessfully used to functionally annotate associations previously identified in genome- wide association studies. We combine multiple sources of evidence in order to identify SNPs that are located in a functional region of the genome and are associated with a phenotype. We show that a majority of known GWAS associations overlap a func- tional region, or are in strong linkage disequilibrium with a SNP overlapping a func- tional region. We find that for a majority of associations, the SNP whose functional role is most strongly supported by ENCODE data is a SNP in linkage disequilibrium with the reported SNP, not the genotyped SNP reported in the association study. We show that there is significant overall enrichment for regulatory function in disease associated regions, and that combining multiple sources of evidence leads to stronger enrichment. We use information from RegulomeDB [123], a database designed for fast annotation of SNPs that combines ENCODE datasets (ChIP-seq peaks, DNaseI hypersensitivity peaks, DNaseI footprints) with additional data sources (ChIP-seq data from the NCBI Sequence Read Archive, conserved motifs, eQTLs and experi- mentally validated functional SNPs). Using these publicly available resources makes the approach presented herein easily applicable to the analysis of any future GWAS study. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 63

5.2 Results

We use linkage disequilibrium information in order to integrate GWAS results with ENCODE data and eQTLs. We call functional SNP any SNP that appears in a region identified as associated with a biochemical event in at least one ENCODE cell line. Functional SNPs can be further subdivided into SNPs that overlap coding or non-coding transcripts, and SNPs that appear in region identified as potentially regulatory, such as ChIP-seq peaks and DNaseI hypersensitive sites. We call the SNPs that are reported to be statistically associated with a phenotype lead SNPs. For each lead SNP we first determine whether the lead SNP itself is a functional SNP, then find all functional SNPs that are in strong linkage disequilibrium with the lead SNP. We integrate eQTL information in a similar way, by checking whether the lead SNP or a SNP in strong linkage disequilibrium with the lead SNP has been associated with a change in gene expression. Figure 5.1 illustrates our approach by describing a scenario in which a lead SNP is in strong linkage disequilibrium with a functional SNP that overlaps a transcription factor binding site, as well as with a third SNP that is an eQTL. If neither the lead SNP nor the eQTL SNP overlap a functional region, then the functional SNP is more likely to be the SNP that plays a biological role in the phenotype than either of the SNPs that were genotyped. An extreme example would be the case where all three SNPs are in perfect linkage disequilibrium, but only the associated SNP was present on the genotyping platform used in the GWAS in which the association was found, and only the eQTL SNP was present on the genotyping platform used in the eQTL study. In this scenario, the functional SNP would be associated equally strongly with the disease and with the change in gene expression than the reported association and eQTL SNPs. In order to show the potential of this approach, we analyze a set of 5694 curated associations from the NHGRI GWAS catalog [11] that represent a total of 4724 distinct SNPs associated with a total of 470 different phenotypes (see Section 5.6 for details). SNP 1 phenotype with Associated SNP 5 SNPFunctional (fSNP) SNP 2 SNP 3 eQTL SNP 1 phenotype with Associated SNP 5 SNPFunctional (fSNP) SNP 2 SNP 3 eQTL SNP 4 SNP 4 Genome Genome

Predicted Predicted SNP 1 phenotype with Associated SNP 1 phenotype with Associated SNP 5 SNPFunctional (fSNP) SNP 5 SNPFunctional (fSNP) SNP 2 SNP 3 eQTL SNP 2 SNP 3 eQTL Motifs Motifs SNP 4 SNP 4 Genome Genome

Predicted Predicted Motifs CHAPTERMotifs 5. INTEGRATING REGULATORY INFORMATION 64

A SNP SNP 1 - Lead phenotype with Associated SNP 6 SNP Functional SNP 2 SNP 3 eQTL A SNP 4 SNP 5 Genome A A

Predicted A G A T T A Motifs G A TC GCGA T C G A TAGCACT C DNaseI Hypersensitivy Peaks

ChIP-Seq Peaks for TF1

1.0 1.0 1.0 Linkage Disequilibrium 1.0

1.0

Figure 5.1: Schematic overview of the functional SNP approach. This figure illustrates the approach we use to identify functional SNPs. Three differ- ent types of regulatory data are represented for an area of the genome: motif-based predictions, DNaseI hypersensitivity peaks, and ChIP-seq peaks. This region con- tains six SNPs. SNP1 is associated with a phenotype in a Genome Wide Association Study. SNP3 is an eQTL associated with changes in gene expression in a different study. SNP6 overlaps a predicted motif, a DNaseI hypersensitivity peak and a ChIP-seq peak. There are therefore multiple sources of evidence that SNP6 is in a regulatory region. Furthermore, SNP6 is in perfect linkage disequilibrium (r2 = 1.0) with SNP1 and SNP3, meaning that there is transitive evidence due to the LD that SNP6 is also associated with the phenotype and is also eQTL. SNP6 is therefore the most likely functional SNP in this associated region.

Tuesday, November 22, 2011 Tuesday, November 22, 2011

Tuesday, November 22, 2011 Tuesday, November 22, 2011 CHAPTER 5. INTEGRATING REGULATORY INFORMATION 65

5.2.1 Lead SNP annotation

We first annotated each lead SNP with transcription information from GENCODE v7 and regulatory information from RegulomeDB. Overall, 44.8% of all lead SNPs overlap with some ENCODE data, making them functional SNPs according to our definition, and 13.1% of the lead SNPs are supported by more than one type of functional evidence. Specifically, 223 lead SNPs (4.7%) overlap coding regions, 146 (3.1%) overlap with the non-coding part of an exon, 1714 (36.3%) overlap with a DNaseI peak in at least one cell line, 355 (7.5%) overlap with a DNaseI footprint, and 938 (19.9%) overlap with a ChIP-seq peak for at least one of the assessed proteins in at least one cell line. Figure 5.2 shows the fraction of lead SNPs supported by different sources of evidence. Thus, we find that many GWAS SNPs overlap ENCODE data.

5.2.2 Linkage disequilibrium

For each lead SNP we next located the set of SNPs that are in strong linkage dis- equilibrium (r2 ≥ 0.8) with the lead SNP in all four HapMap 2 populations, and annotate each SNP in this set. As expected, the fraction of lead SNPs in strong link- age disequilibrium with a SNP overlapping each type of functional evidence is larger than when considering lead SNPs alone (Figure 5.2), and 58% of all associations are in strong linkage disequilibrium with at least one functional SNP. A similar increase can be observed for functional SNPs supported by multiple sources of evidence. A detailed breakdown for each type of functional evidence for multiple linkage disequi- librium thresholds is provided in Table 5.1. We repeated the same analysis for the 2464 lead SNPs that have been associated with a phenotype in a population of Eu- ropean descent, using SNPs in strong linkage disequilibrium (r2 ≥ 0.8) with the lead SNP in the European HapMap population only. A total of 81% of the lead SNPs are in strong LD with at least one functional SNP, and 59% of the associated SNPs are in strong linkage disequilibrium with a functional SNP supported by multiple sources of evidence. A detailed breakdown is provided in Table 5.2. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 66

A: B: 36% 1%

Lead SNP 7.5% 5% 3% 3% 9% 5% 19% 20% 36%

20%

46% 2% r2 ! 0.8 in all 7% 5% 5% 12% 6% 20% 18% 24% Populations 15% 32%

75% r2 ! 0.8 in 38% 16% 11% 12% 4% 16% 7% 15% 11% 8% CEU 61%

0% 25% 50% 75% 100% 0% Coding25% 50% 75% 100% DNaseI peak Transcribed, non-coding DNaseI footprint ChIP-seq and DNaseI peak, matched motif or footprint ChIP-seq peak ChIP-seq peak and DNaseI peak and motif ChIP-seq and DNaseI peak ChIP-seq peak only DNaseI peak only Motif only (no experimental support) No annotation Figure 5.2: Proportions of associations for different types of functional data. Proportions are shown for individual assays (A.), and for all sources of evidence combined (B.) Proportions are presented separately for lead SNPs and SNPs in strong linkage disequilibrium (r2 ≥ 0.8) with a lead SNP. For each association we determine which SNP in the LD region is most strongly supported by functional data in order to generate the proportions in panel B. We separately consider SNPs in strong linkage disequilibrium with a lead SNP in all HapMap 2 populations, and SNPs in strong linkage disequilibrium with a lead SNP in the CEU population. For the latter case, we use only associations identified in population of European descent, and show that we can map 80% of these associations to a functional SNP supported by experimental ENCODE data. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 67

Lead Perfect LD r2 ≥ 0.9 r2 ≥ 0.8 Count Count Count Count Total 4724 4724 4724 4724 Predicted motif 1335 28.3% 1656 35.1% 1947 41.2% 2181 46.2% DNaseI hypersensitivity peak 1714 36.3% 1979 41.9% 2189 46.3% 2374 50.3% DNaseI footprint 355 7.5% 471 10.0% 600 12.7% 718 15.2% ChIP-seq peak 938 19.9% 1144 24.2% 1318 27.9% 1491 31.6% Coding 223 4.7% 268 5.7% 306 6.5% 345 7.3% Exon, non-coding 146 3.1% 165 3.5% 202 4.3% 232 4.9% RegulomeDB Score 2 (total) 126 2.7% 165 3.5% 197 4.2% 248 5.2% RegulomeDB Score 2a 14 0.3% 16 0.3% 21 0.4% 26 0.6% RegulomeDB Score 2b 109 2.3% 145 3.1% 169 3.6% 212 4.5% RegulomeDB Score 2c 3 0.1% 4 0.1% 7 0.1% 10 0.2% RegulomeDB Score 3 (total) 58 1.2% 75 1.6% 84 1.8% 89 1.9% RegulomeDB Score 3a 55 1.2% 69 1.5% 78 1.7% 83 1.8% RegulomeDB Score 3b 3 0.1% 6 0.1% 6 0.1% 6 0.1% RegulomeDB Score 4 436 9.2% 492 10.4% 538 11.4% 571 12.1% RegulomeDB Score 5 (total) 1110 23.5% 1192 25.2% 1240 26.2% 1254 26.5% RegulomeDB Score 5a 218 4.6% 246 5.2% 270 5.7% 289 6.1% RegulomeDB Score 5b 892 18.9% 946 20.0% 970 20.5% 965 20.4% RegulomeDB Score 6 933 19.8% 908 19.2% 873 18.5% 838 17.7% RegulomeDB Score 7 1692 35.8% 1459 30.9% 1284 27.2% 1147 24.3% eQTL 462 9.8% 514 10.9% 556 11.8% 597 12.6% eQTL + Predicted motif 113 2.4% 206 4.4% 275 5.8% 329 7.0% eQTL + DNaseI hypersensitivity peak 201 4.3% 300 6.4% 359 7.6% 413 8.7% eQTL + DNaseI footprint 40 0.8% 89 1.9% 129 2.7% 165 3.5% eQTL + ChIP-seq peak 118 2.5% 197 4.2% 247 5.2% 306 6.5% eQTL + RegulomeDB Score 2 (total) 17 0.4% 28 0.6% 34 0.7% 45 1.0% eQTL + RegulomeDB Score 2a 1 0.0% 1 0.0% 3 0.1% 6 0.1% eQTL + RegulomeDB Score 2b 16 0.3% 27 0.6% 30 0.6% 38 0.8% eQTL + RegulomeDB Score 2c 0 0.0% 0 0.0% 1 0.0% 1 0.0% eQTL + RegulomeDB Score 3 (total) 5 0.1% 11 0.2% 12 0.3% 14 0.3% eQTL + RegulomeDB Score 3a 4 0.1% 10 0.2% 11 0.2% 13 0.3% eQTL + RegulomeDB Score 3b 1 0.0% 1 0.0% 1 0.0% 1 0.0% eQTL + RegulomeDB Score 4 56 1.2% 77 1.6% 87 1.8% 94 2.0% eQTL + RegulomeDB Score 5 (total) 117 2.5% 131 2.8% 137 2.9% 130 2.8% eQTL + RegulomeDB Score 5a 27 0.6% 30 0.6% 33 0.7% 36 0.8% eQTL + RegulomeDB Score 5b 90 1.9% 101 2.1% 104 2.2% 94 2.0% eQTL + RegulomeDB Score 6 200 4.2% 159 3.4% 144 3.0% 136 2.9%

Table 5.1: Fraction of associations overlapping functional regions for different linkage disequilibrium thresholds. Only functional SNPs that are in linkage disequilibrium at or above the indicate threshold in all HapMap 2 populations are used. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 68

Lead Perfect LD r2 ≥ 0.9 r2 ≥ 0.8 Count Count Count Count Total 2461 2461 2461 2461 Predicted motif 729 29.6% 1470 59.7% 1678 68.2% 1846 75.0% DNaseI hypersensitivity peak 959 39.0% 1527 62.0% 1709 69.4% 1846 75.0% DNaseI footprint 213 8.7% 614 24.9% 755 30.7% 935 38.0% ChIP-seq peak 527 21.4% 1107 45.0% 1310 53.2% 1492 60.6% Coding 135 5.5% 266 10.8% 333 13.5% 396 16.1% Exon, non-coding 86 3.5% 186 7.6% 223 9.1% 276 11.2% RegulomeDB Score 2 (total) 84 3.4% 204 8.3% 240 9.8% 285 11.6% RegulomeDB Score 2a 9 0.4% 23 0.9% 23 0.9% 31 1.3% RegulomeDB Score 2b 73 3.0% 174 7.1% 207 8.4% 245 10.0% RegulomeDB Score 2c 2 0.1% 7 0.3% 10 0.4% 9 0.4% RegulomeDB Score 3 (total) 28 1.1% 67 2.7% 87 3.5% 92 3.7% RegulomeDB Score 3a 27 1.1% 65 2.6% 85 3.5% 90 3.7% RegulomeDB Score 3b 1 0.0% 2 0.1% 2 0.1% 2 0.1% RegulomeDB Score 4 245 10.0% 367 14.9% 392 15.9% 397 16.1% RegulomeDB Score 5 (total) 579 23.5% 609 24.7% 597 24.3% 544 22.1% RegulomeDB Score 5a 107 4.3% 175 7.1% 187 7.6% 175 7.1% RegulomeDB Score 5b 472 19.2% 434 17.6% 410 16.7% 369 15.0% RegulomeDB Score 6 495 20.1% 382 15.5% 310 12.6% 263 10.7% RegulomeDB Score 7 0 0.0% 380 15.4% 279 11.3% 208 8.5% eQTL 244 9.9% 373 15.2% 419 17.0% 483 19.6% eQTL + Predicted motif 68 2.8% 270 11.0% 333 13.5% 409 16.6% eQTL + DNaseI hypersensitivity peak 112 4.6% 294 11.9% 361 14.7% 439 17.8% eQTL + DNaseI footprint 30 1.2% 168 6.8% 216 8.8% 288 11.7% eQTL + ChIP-seq peak 64 2.6% 234 9.5% 305 12.4% 390 15.8% eQTL + RegulomeDB Score 2 (total) 14 0.6% 32 1.3% 33 1.3% 41 1.7% eQTL + RegulomeDB Score 2a 0 0.0% 8 0.3% 6 0.2% 5 0.2% eQTL + RegulomeDB Score 2b 14 0.6% 23 0.9% 27 1.1% 36 1.5% eQTL + RegulomeDB Score 2c 0 0.0% 1 0.0% 0 0.0% 1 0.0% eQTL + RegulomeDB Score 3 (total) 2 0.1% 12 0.5% 19 0.8% 23 0.9% eQTL + RegulomeDB Score 3a 2 0.1% 11 0.4% 18 0.7% 23 0.9% eQTL + RegulomeDB Score 3b 0 0.0% 1 0.0% 1 0.0% 0 0.0% eQTL + RegulomeDB Score 4 27 1.1% 46 1.9% 55 2.2% 59 2.4% eQTL + RegulomeDB Score 5 (total) 60 2.4% 68 2.8% 64 2.6% 43 1.7% eQTL + RegulomeDB Score 5a 13 0.5% 18 0.7% 21 0.9% 12 0.5% eQTL + RegulomeDB Score 5b 47 1.9% 50 2.0% 43 1.7% 31 1.3% eQTL + RegulomeDB Score 6 104 4.2% 56 2.3% 39 1.6% 31 1.3%

Table 5.2: Fraction of associations overlapping functional regions for different linkage disequilibrium thresholds (European populations). Functional SNPs that are in linkage disequilibrium with the lead SNP at or above the indicate threshold in the HapMap 2 CEU population are used. Only associations that were identified or replicated in populations of European descent are used. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 69

5.2.3 Integrating gene expression data

We integrated data from multiple eQTL studies that identified SNPs associated with changes in gene expression in several tissues. A total of 462 lead SNPs (9.8%) are also themselves an eQTL in at least one tissue, and an additional 135 lead SNPs (2.8%) are in strong LD (r2 ≥ 0.8 in all HapMap 2 populations) with an eQTL. When considering only associations in populations of European descent, 483 lead SNPs (19.6%) are either an eQTL, or in strong LD with an eQTL. We observe that amongst lead SNPs that are also eQTLs, the fraction that overlaps DNaseI peaks (201, 43.5%) and ChIP-seq peaks (118, 25.5%) is significantly higher than when considering all lead SNPs (P-values of respectively 7.6x10−4 and 1.7x10−3).

5.2.4 SNP comparison within linkage disequilibrium regions

ENCODE data can be used in order to compare multiple functional SNPs that are in LD with a given lead SNP. We used a two-step approach to compare the functional annotation of two SNPs. First, if one of the SNPs is in a coding region according to GENCODE v7 and the other one is not, the coding SNP is considered to be more likely to be functional. Similarly, a SNP in a non-coding part of an exon is considered to be more likely to be functional than a SNP in an intergenic region or an intron. Second, if both SNPs are not in exons, then we compared the amount of evidence across data sources supporting the functional role of the SNP using a scoring scheme integrated in RegulomeDB (see Section 5.5.5). We hypothesized that a SNP sup- ported by multiple types of evidence (eg. a ChIP-seq peak and a DNaseI footprint) is more likely to be functional than a SNP supported by a single experimental modality. We find that most associations where the lead SNP is in LD with at least one other SNP, the SNP with the most strongly supported functional SNP is not the lead SNP itself, but another SNP in the LD region (22.4% compared to 13.6% when using LD in all populations, 56.8% compared to 13.6% percent when considering CEU only, Table 5.3). These results show that in most cases, the associated SNP reported in a GWAS is not the most likely to play a biological role in the phenotype according to ENCODE data. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 70

All populations CEU only Only lead SNP coding 199 4.21% 87 3.53% Only lead SNP transcribed, non-coding 113 2.39% 39 1.58% Lead SNP supported by more regulatory evidence 329 6.96% 208 8.44% Lead better 641 13.56% 334 13.56% Lead SNP and SNP in LD coding 24 0.51% 48 1.95% Lead SNP and SNP in LD transcribed, non-coding 21 0.44% 30 1.22% Lead SNP and SNP in LD have similar regulatory evidence 282 5.97% 193 7.83% Lead and SNP in LD equal 327 6.92% 271 11.00% Lead SNP transcribed, non-coding, SNP in LD coding 12 0.25% 17 0.69% Lead SNP not transcribed, SNP in LD coding 110 2.33% 244 9.90% Lead SNP not transcribed, SNP in LD transcribed, non-coding 98 2.07% 207 8.40% SNP in LD supported by more regulatory evidence 356 7.53% 456 18.51% SNP in LD annotated, lead SNP not annotated 483 10.22% 476 19.32% SNP in LD better 1059 22.40% 1400 56.82% No annotation 1147 24.26% 208 8.44% Lead SNP annotated, no SNP in LD 1553 32.85% 251 10.19%

Table 5.3: Comparison of functional evidence between the lead SNP and the best SNP in the linkage disequilibrium region. Lines in bold represent totals for each case. When considering a linkage disequilib- rium threshold in the CEU population alone, only associations that were identified or replicated in populations of European descent are used. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 71

5.2.5 Associations are enriched for regulatory elements

2.5 eQTL Predicted motif 2.40 DNaseI hypersensitivity peak 2 DNaseI footprint ChIP-seq peak ChIP-seq peak and DNaseI peak and motif

1.63 1.68 1.5 1.57 1.44 Fold enrichment Fold 1.36 1.33 1.22 1.25 1.05 1.12 1 4% All Lead SNPs Lead SNP and eQTLs Figure 5.3: Enrichment for different combinations of assays.

Enrichments are reported for all lead SNPs associated with a phenotype, and sepa- rately for lead SNPs that are also eQTLs or in strong linkage disequilibrium with an eQTL. The enrichment for predicted motifs alone (italic) is not significant. These results show that combining multiple types of experimental evidence increases the observed enrichment.

We performed randomizations in order to compare the fraction of lead SNPs that are functional SNPs or are in linkage disequilibrium with a functional SNP, to the ex- pected fraction amongst all SNPs. We found that associated regions are significantly enriched for functional SNPs identified using DNase-seq and ChIP-seq. Furthermore, enrichments increased, both when integrating multiple ENCODE assays and when adding eQTL information. We used a subset of 2364 lead SNPs for which sufficient information is available, and built 100 random matched SNP sets in which each lead SNP is replaced by a similar SNP (see Section 5.6 for details). We compared the fraction of lead SNPs overlapping functional regions in the set of actual lead SNPs to the fractions observed in the random sets, and computed enrichment values in order to show that the fraction of associated SNPs that overlap functional regions is higher than expected. Figure 5.3 provides an overview of the enrichment for different types of functional data. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 72

When considering lead SNPs only, we observed a 1.12 fold enrichment for DNase peaks, a 1.22 fold enrichment for DNase footprints and a 1.25 fold enrichment for ChIP-seq peaks. All enrichments are statistically significant (P-values of respec- tively 1.3 · 10−4, 0.005 and 1.3 · 10−6). We also observed that combining multiple types of evidence increases the enrichment: there is a 1.36 fold enrichment for lead SNPs that overlap with a ChIP-seq peak, a DNase peak, a DNase footprint and a predicted motif. Similarly, there is an 1.33 fold enrichment for eQTLs, and an even higher enrichment for eQTLs that also overlap functional regions (up to 2.4 fold). The enrichments can be compared to the 1.05 fold enrichment (not significant, P-value 0.087) observed when considering overlap with motif based-predictions, which do not make use of ENCODE data. When extending the set of possible functional SNPs to SNPs that are in linkage disequilibrium with a lead SNP, we observed a decrease in the enrichment (Figure 5.4A,B). At an r2 LD threshold of 0.8, enrichments for most individual modalities are barely significant, but enrichment for functional SNPs supported by multiple sources of evidence remain significant. Limiting the set of lead SNPs to the most strongly supported associations (repli- cation in a different cohort in the original study or in multiple studies) leads to an increase in enrichment (Figure 5.4C). A total of 1216 lead SNPs were curated from an original study that included a separate replication population, and 478 of them overlap DNase1 peaks (39.3%). A total of 166 lead SNPs were replicated in a differ- ent study, and 85 of them (51.2%) overlap a DNase1 peak. A similar trend can be observed for ChIP-seq peaks. The enrichments for different types of functional evidence and multiple linkage disequilibrium thresholds when considering only functional SNPs that are in linkage disequilibrium with the lead SNP at or above the indicated threshold in all HapMap 2 are presented in details in Tables 5.4 (lead SNPs), 5.5 (perfect linkage disequilib- rium, r2 = 1.0), 5.6 (r2 ≥ 0.9), 5.7 (r2 ≥ 0.8) and 5.8 (r2 ≥ 0.5). Entries with significant enrichment (P-value < 0.05) are represented in bold. SNPs are filtered as described in Section 5.6.3, resulting in smaller sets than in Table 5.1. We separately consider associations that were identified or replicated in populations of European descent separately, and consider functional SNPs that are in linkage disequilibrium CHAPTER 5. INTEGRATING REGULATORY INFORMATION 73

with the lead SNP at or above the indicated threshold in the HapMap 2 CEU pop- ulation, and provide a detailed enrichments in Tables 5.9 (lead SNPs), 5.10 (perfect linkage disequilibrium, r2 = 1.0), 5.11 (r2 ≥ 0.9), 5.12 (r2 ≥ 0.8) and 5.13 (r2 ≥ 0.5). CHAPTER 5. INTEGRATING REGULATORY INFORMATION 74

/0)1%#-$23"%&)+*#)4-5"#"%&)67)&2#",2*84,)

)!"

(!" 5/,9+:";+,<9" '!" =>;+34+-"5/,9+:";+,<9" &!" ?@:0A9+B";+,<9"

%!" =>;+34+-"?@:0A9+B";+,<9"

$!" 5C,9+:"2DD4;1EC49" !"#$"%&'(")*+)',,*$-'.*%,) #!" =>;+34+-"5/,9+:"2DD4;1EC49"

!" *+,-./0" 0+12+34"*5" *5"!67" *5"!68" *5"!6)" *5"!6(" *5"!6'"

90)1%#-$23"%&,)+*#)7:',";)<"'=,) >0)1%#-$23"%&,)+*#)>2;!?,"@)<"'=,) (!" %!" '!" $'" &!" $!" %!" #'" $!" #!" #!" '" !" !"

Figure 5.4: Overview of enrichment.

(A.) Percentage of associated SNPs mapped to a functional SNP overlapping DNa- seI peaks, DNaseI footprints and ChIP-seq peaks (full lines) compared to expected percentages in the matched null sets (dotted line) for various linkage disequilibrium thresholds. As the LD threshold decreases, the fraction of associations that can be mapped to functional SNPs increases, but the enrichment for functional SNPs amongst associated SNPs decreases. Comparison of the fraction of SNPs overlap- ping DNaseI (B.) and ChIP-seq peaks (C.) in various null sets of matched random SNPs (blue) and sets of associated SNPs (green). The fraction of random SNPs overlapping DNaseI peaks and ChIP-seq peaks increases when properties of associ- ated SNPs are matched more closely, and when considering associations supported by more evidence. The red arrows indicate the most stringent matched null set, and the set of all associations. We compare those two sets in order to show that enrichments are significant. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 75

Observed Exp. Enrichment P-value Total 2364 Predicted motif 688 29.1% 27.7% 1.05 0.087 DNaseI hypersensitivity peak 810 34.3% 30.5% 1.12 1.3 · 10−4 DNaseI footprint 178 7.5% 6.2% 1.22 0.005 ChIP-seq peak 446 18.9% 15.1% 1.25 1.3 · 10−6 RegulomeDB Score 2 63 2.7% 2.0% 1.36 0.015 RegulomeDB Score ≤ 3 86 3.6% 2.8% 1.32 0.012 RegulomeDB Score ≤ 4 290 12.3% 9.5% 1.29 3.6 · 10−5 RegulomeDB Score ≤ 5 839 35.5% 31.6% 1.12 1.3 · 10−4 RegulomeDB Score ≤ 6 1326 56.1% 51.1% 1.10 3.5 · 10−7 eQTL 191 8.1% 6.1% 1.33 1.0 · 10−4 eQTL + Predicted motif 53 2.2% 1.6% 1.44 0.015 eQTL + DNaseI hypersensitivity peak 91 3.8% 2.5% 1.57 4.7 · 10−5 eQTL + DNaseI footprint 20 0.8% 0.5% 1.63 0.021 eQTL + ChIP-seq peak 52 2.2% 1.3% 1.68 1.3 · 10−4 eQTL + RegulomeDB Score 2 10 0.4% 0.2% 2.40 0.003 eQTL + RegulomeDB Score ≤ 3 13 0.5% 0.2% 2.36 0.002 eQTL + RegulomeDB Score ≤ 4 32 1.4% 0.8% 1.63 0.002 eQTL + RegulomeDB Score ≤ 5 85 3.6% 2.3% 1.54 2.4 · 10−4 eQTL + RegulomeDB Score ≤ 6 162 6.9% 5.2% 1.31 6.5 · 10−4

Table 5.4: Overview of enrichment: lead SNPs only, all populations.

Observed Exp. Enrichment P-value Total 2364 Predicted motif 855 36.2% 37.5% 0.97 0.164 DNaseI hypersensitivity peak 942 39.8% 37.2% 1.07 0.009 DNaseI footprint 237 10.0% 9.0% 1.12 0.063 ChIP-seq peak 546 23.1% 20.5% 1.13 0.001 RegulomeDB Score 2 79 3.3% 2.7% 1.22 0.086 RegulomeDB Score ≤ 3 107 4.5% 3.9% 1.15 0.149 RegulomeDB Score ≤ 4 342 14.5% 12.2% 1.19 0.002 RegulomeDB Score ≤ 5 950 40.2% 36.8% 1.09 0.001 RegulomeDB Score ≤ 6 1441 61.0% 57.0% 1.07 7.1 · 10−5 eQTL 213 9.0% 7.1% 1.26 0.001 eQTL + Predicted motif 84 3.6% 3.3% 1.09 0.460 eQTL + DNaseI hypersensitivity peak 129 5.5% 4.0% 1.38 0.001 eQTL + DNaseI footprint 39 1.6% 1.3% 1.30 0.096 eQTL + ChIP-seq peak 84 3.6% 2.6% 1.37 0.007 eQTL + RegulomeDB Score 2 14 0.6% 0.4% 1.63 0.068 eQTL + RegulomeDB Score ≤ 3 19 0.8% 0.5% 1.69 0.045 eQTL + RegulomeDB Score ≤ 4 44 1.9% 1.3% 1.38 0.025 eQTL + RegulomeDB Score ≤ 5 109 4.6% 3.2% 1.45 3.5 · 10−4 eQTL + RegulomeDB Score ≤ 6 169 7.1% 5.6% 1.28 0.002

Table 5.5: Overview of enrichment: perfect LD, all populations. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 76

Observed Exp. Enrichment P-value Total 2364 Predicted motif 1011 42.8% 44.3% 0.96 0.119 DNaseI hypersensitivity peak 1066 45.1% 42.3% 1.07 0.009 DNaseI footprint 307 13.0% 11.5% 1.13 0.047 ChIP-seq peak 642 27.2% 24.9% 1.09 0.012 RegulomeDB Score 2 97 4.1% 3.4% 1.22 0.041 RegulomeDB Score ≤ 3 130 5.5% 4.9% 1.13 0.147 RegulomeDB Score ≤ 4 393 16.6% 14.3% 1.16 0.002 RegulomeDB Score ≤ 5 1040 44.0% 40.3% 1.09 0.001 RegulomeDB Score ≤ 6 1514 64.0% 60.4% 1.06 1.3 · 10−4 eQTL 234 9.9% 8.0% 1.24 0.001 eQTL + Predicted motif 114 4.8% 4.5% 1.07 0.482 eQTL + DNaseI hypersensitivity peak 159 6.7% 5.1% 1.32 0.001 eQTL + DNaseI footprint 58 2.5% 2.0% 1.24 0.118 eQTL + ChIP-seq peak 111 4.7% 3.6% 1.30 0.007 eQTL + RegulomeDB Score 2 18 0.8% 0.5% 1.63 0.033 eQTL + RegulomeDB Score ≤ 3 23 1.0% 0.6% 1.53 0.052 eQTL + RegulomeDB Score ≤ 4 57 2.4% 1.7% 1.44 0.005 eQTL + RegulomeDB Score ≤ 5 122 5.2% 3.6% 1.43 2.9 · 10−4 eQTL + RegulomeDB Score ≤ 6 176 7.4% 5.7% 1.30 0.001

Table 5.6: Overview of enrichment: r2 ≥ 0.9, all populations.

Observed Exp. Enrichment P-value Total 2364 Predicted motif 1144 48.4% 50.6% 0.96 0.032 DNaseI hypersensitivity peak 1171 49.5% 47.4% 1.05 0.038 DNaseI footprint 368 15.6% 14.3% 1.09 0.077 ChIP-seq peak 727 30.8% 29.5% 1.04 0.168 RegulomeDB Score 2 125 5.3% 4.0% 1.33 0.001 RegulomeDB Score ≤ 3 162 6.9% 5.8% 1.18 0.038 RegulomeDB Score ≤ 4 449 19.0% 16.4% 1.16 0.002 RegulomeDB Score ≤ 5 1098 46.4% 43.7% 1.06 0.005 RegulomeDB Score ≤ 6 1562 66.1% 63.1% 1.05 0.001 eQTL 256 10.8% 8.7% 1.24 7.3 · 10−04 eQTL + Predicted motif 139 5.9% 5.7% 1.03 0.692 eQTL + DNaseI hypersensitivity peak 183 7.7% 6.2% 1.25 0.003 eQTL + DNaseI footprint 73 3.1% 2.7% 1.13 0.300 eQTL + ChIP-seq peak 136 5.8% 4.6% 1.24 0.014 eQTL + RegulomeDB Score 2 23 1.0% 0.6% 1.63 0.015 eQTL + RegulomeDB Score ≤ 3 29 1.2% 0.8% 1.51 0.020 eQTL + RegulomeDB Score ≤ 4 67 2.8% 2.0% 1.41 0.002 eQTL + RegulomeDB Score ≤ 5 130 5.5% 4.0% 1.37 2.6 · 10−4 eQTL + RegulomeDB Score ≤ 6 181 7.7% 5.8% 1.32 1.7 · 10−4

Table 5.7: Overview of enrichment: r2 ≥ 0.8, all populations. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 77

Observed Exp. Enrichment P-value Total 2364 Predicted motif 1467 62.1% 64.9% 0.96 0.006 DNaseI hypersensitivity peak 1461 61.8% 60.1% 1.03 0.077 DNaseI footprint 554 23.4% 22.6% 1.04 0.283 ChIP-seq peak 988 41.8% 41.8% 1.00 0.958 RegulomeDB Score 2 174 7.4% 5.9% 1.24 0.004 RegulomeDB Score ≤ 3 231 9.8% 8.7% 1.13 0.063 RegulomeDB Score ≤ 4 573 24.2% 22.2% 1.09 0.022 RegulomeDB Score ≤ 5 1239 52.4% 50.9% 1.03 0.112 RegulomeDB Score ≤ 6 1623 68.7% 67.3% 1.02 0.133 eQTL 305 12.9% 10.8% 1.19 0.001 eQTL + Predicted motif 223 9.4% 8.7% 1.08 0.206 eQTL + DNaseI hypersensitivity peak 258 10.9% 9.1% 1.21 0.001 eQTL + DNaseI footprint 135 5.7% 5.1% 1.11 0.162 eQTL + ChIP-seq peak 207 8.8% 7.6% 1.15 0.030 eQTL + RegulomeDB Score 2 38 1.6% 0.9% 1.71 1.0 · 10−4 eQTL + RegulomeDB Score ≤ 3 46 1.9% 1.3% 1.56 0.002 eQTL + RegulomeDB Score ≤ 4 94 4.0% 2.8% 1.44 8.6 · 10−5 eQTL + RegulomeDB Score ≤ 5 147 6.2% 4.6% 1.34 6.3 · 10−5 eQTL + RegulomeDB Score ≤ 6 181 7.7% 5.8% 1.31 6.6 · 10−5

Table 5.8: Overview of enrichment: r2 ≥ 0.5, all populations.

Observed Exp. Enrichment P-value Total 1310 Predicted motif 401 30.6% 28.0% 1.09 0.019 DNaseI hypersensitivity peak 474 36.2% 31.1% 1.16 1.3 · 10−4 DNaseI footprint 114 8.7% 6.4% 1.35 4.2 · 10−4 ChIP-seq peak 266 20.3% 15.6% 1.30 4.9 · 10−7 RegulomeDB Score 2 43 3.3% 2.0% 1.62 0.001 RegulomeDB Score ≤ 3 55 4.2% 2.8% 1.48 0.002 RegulomeDB Score ≤ 4 183 14.0% 9.8% 1.43 2.4 · 10−7 RegulomeDB Score ≤ 5 478 36.5% 32.0% 1.14 0.002 RegulomeDB Score ≤ 6 758 57.9% 51.3% 1.13 3.3 · 10−6 eQTL 107 8.2% 6.1% 1.35 0.002 eQTL + Predicted motif 34 2.6% 1.6% 1.67 0.005 eQTL + DNaseI hypersensitivity peak 56 4.3% 2.5% 1.74 6.5 · 10−5 eQTL + DNaseI footprint 16 1.2% 0.5% 2.32 2.5 · 10−4 eQTL + ChIP-seq peak 33 2.5% 1.4% 1.83 1.4 · 10−4 eQTL + RegulomeDB Score 2 8 0.6% 0.2% 3.56 1.2 · 10−4 eQTL + RegulomeDB Score ≤ 3 10 0.8% 0.2% 3.42 1.1 · 10−4 eQTL + RegulomeDB Score ≤ 4 21 1.6% 0.8% 1.89 0.001 eQTL + RegulomeDB Score ≤ 5 47 3.6% 2.4% 1.52 0.007 eQTL + RegulomeDB Score ≤ 6 87 6.6% 5.2% 1.28 0.022

Table 5.9: Overview of enrichment: lead SNPs only, European populations. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 78

Observed Exp. Enrichment P-value Total 1310 Predicted motif 775 59.2% 62.6% 0.95 0.011 DNaseI hypersensitivity peak 774 59.1% 58.8% 1.00 0.850 DNaseI footprint 302 23.1% 22.3% 1.03 0.526 ChIP-seq peak 553 42.2% 40.8% 1.04 0.233 RegulomeDB Score 2 105 8.0% 6.0% 1.34 0.002 RegulomeDB Score ≤ 3 140 10.7% 8.7% 1.23 0.011 RegulomeDB Score ≤ 4 337 25.7% 21.6% 1.19 1.1 · 10−4 RegulomeDB Score ≤ 5 670 51.1% 49.5% 1.03 0.175 RegulomeDB Score ≤ 6 892 68.1% 65.7% 1.04 0.044 eQTL 150 11.5% 10.1% 1.14 0.120 eQTL + Predicted motif 102 7.8% 7.8% 1.00 0.968 eQTL + DNaseI hypersensitivity peak 119 9.1% 8.1% 1.12 0.212 eQTL + DNaseI footprint 64 4.9% 4.4% 1.11 0.385 eQTL + ChIP-seq peak 88 6.7% 6.8% 0.99 0.952 eQTL + RegulomeDB Score 2 14 1.1% 0.8% 1.31 0.288 eQTL + RegulomeDB Score ≤ 3 20 1.5% 1.1% 1.37 0.129 eQTL + RegulomeDB Score ≤ 4 40 3.1% 2.5% 1.24 0.159 eQTL + RegulomeDB Score ≤ 5 68 5.2% 4.3% 1.21 0.125 eQTL + RegulomeDB Score ≤ 6 90 6.9% 5.6% 1.23 0.061

Table 5.10: Overview of enrichment: perfect LD, European populations.

Observed Exp. Enrichment P-value Total 1310 Predicted motif 888 67.8% 71.1% 0.95 0.013 DNaseI hypersensitivity peak 884 67.5% 66.7% 1.01 0.512 DNaseI footprint 377 28.8% 28.8% 1.00 0.988 ChIP-seq peak 668 51.0% 49.1% 1.04 0.134 RegulomeDB Score 2 120 9.2% 7.3% 1.25 0.014 RegulomeDB Score ≤ 3 163 12.4% 10.7% 1.16 0.045 RegulomeDB Score ≤ 4 389 29.7% 25.3% 1.18 1.7 · 10−4 RegulomeDB Score ≤ 5 717 54.7% 53.1% 1.03 0.134 RegulomeDB Score ≤ 6 900 68.7% 67.1% 1.02 0.145 eQTL 173 13.2% 11.7% 1.13 0.096 eQTL + Predicted motif 135 10.3% 9.9% 1.04 0.670 eQTL + DNaseI hypersensitivity peak 151 11.5% 10.2% 1.13 0.138 eQTL + DNaseI footprint 89 6.8% 6.4% 1.07 0.480 eQTL + ChIP-seq peak 123 9.4% 8.9% 1.05 0.571 eQTL + RegulomeDB Score 2 14 1.1% 1.0% 1.03 0.916 eQTL + RegulomeDB Score ≤ 3 22 1.7% 1.4% 1.20 0.369 eQTL + RegulomeDB Score ≤ 4 47 3.6% 2.9% 1.24 0.141 eQTL + RegulomeDB Score ≤ 5 73 5.6% 4.6% 1.21 0.106 eQTL + RegulomeDB Score ≤ 6 89 6.8% 5.5% 1.23 0.050

Table 5.11: Overview of enrichment: r2 ≥ 0.9, European populations. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 79

Observed Exp. Enrichment P-value Total 1310 Predicted motif 985 75.2% 77.6% 0.97 0.044 DNaseI hypersensitivity peak 962 73.4% 73.3% 1.00 0.931 DNaseI footprint 475 36.3% 34.9% 1.04 0.297 ChIP-seq peak 772 58.9% 56.7% 1.04 0.107 RegulomeDB Score 2 144 11.0% 8.7% 1.27 0.004 RegulomeDB Score ≤ 3 196 15.0% 12.6% 1.19 0.011 RegulomeDB Score ≤ 4 433 33.1% 28.3% 1.17 1.2 · 10−4 RegulomeDB Score ≤ 5 733 56.0% 55.1% 1.02 0.417 RegulomeDB Score ≤ 6 889 67.9% 66.8% 1.02 0.291 eQTL 206 15.7% 13.2% 1.19 0.004 eQTL + Predicted motif 170 13.0% 11.9% 1.09 0.188 eQTL + DNaseI hypersensitivity peak 186 14.2% 12.1% 1.18 0.014 eQTL + DNaseI footprint 125 9.5% 8.2% 1.16 0.040 eQTL + ChIP-seq peak 162 12.4% 10.9% 1.13 0.084 eQTL + RegulomeDB Score 2 17 1.3% 1.2% 1.08 0.730 eQTL + RegulomeDB Score ≤ 3 27 2.1% 1.6% 1.27 0.172 eQTL + RegulomeDB Score ≤ 4 53 4.0% 3.2% 1.28 0.042 eQTL + RegulomeDB Score ≤ 5 71 5.4% 4.7% 1.15 0.199 eQTL + RegulomeDB Score ≤ 6 84 6.4% 5.4% 1.19 0.069

Table 5.12: Overview of enrichment: r2 ≥ 0.8, European populations.

Observed Exp. Enrichment P-value Total 1310 Predicted motif 1179 90.0% 90.3% 1.00 0.755 DNaseI hypersensitivity peak 1154 88.1% 87.0% 1.01 0.207 DNaseI footprint 736 56.2% 52.4% 1.07 0.001 ChIP-seq peak 1025 78.2% 74.5% 1.05 0.002 RegulomeDB Score 2 199 15.2% 12.3% 1.23 0.002 RegulomeDB Score ≤ 3 262 20.0% 17.6% 1.14 0.024 RegulomeDB Score ≤ 4 490 37.4% 35.0% 1.07 0.075 RegulomeDB Score ≤ 5 734 56.0% 56.2% 1.00 0.863 RegulomeDB Score ≤ 6 807 61.6% 62.4% 0.99 0.490 eQTL 304 23.2% 18.5% 1.25 1.7 · 10−6 eQTL + Predicted motif 292 22.3% 18.0% 1.24 1.2 · 10−5 eQTL + DNaseI hypersensitivity peak 300 22.9% 18.1% 1.27 7.1 · 10−7 eQTL + DNaseI footprint 250 19.1% 14.5% 1.31 1.3 · 10−7 eQTL + ChIP-seq peak 287 21.9% 17.2% 1.27 8.6 · 10−7 eQTL + RegulomeDB Score 2 36 2.7% 1.9% 1.46 0.027 eQTL + RegulomeDB Score ≤ 3 49 3.7% 2.5% 1.52 0.004 eQTL + RegulomeDB Score ≤ 4 74 5.6% 4.2% 1.34 0.010 eQTL + RegulomeDB Score ≤ 5 88 6.7% 5.4% 1.25 0.030 eQTL + RegulomeDB Score ≤ 6 91 6.9% 5.6% 1.23 0.046

Table 5.13: Overview of enrichment: r2 ≥ 0.5, European populations. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 80

5.2.6 Analysis at the phenotype level

In addition to considering individual associations separately, we can group associated SNPs in order to search for patterns at the phenotype level. We first assessed whether there are specific sequence binding proteins that tend to overlap functional SNPs associated with certain phenotypes more often than expected, using only associations in populations of European descent (Figure 5.5). We found a strong association (P-value 9 · 10−5) between height and CTCF ChIP-seq peaks. A total of 39 SNPs associated with height overlap a ChIP-seq peak or are strong linkage disequilibrium (r2 ≥ 0.8 in the CEU population) with a SNP that overlaps a ChIP-seq peak, and 15 of those (38%) overlap a peak for CTCF (Table 5.14), compared to 89 out of 626 SNPs (14%) when considering all phenotypes. We also found an interesting interaction between prostate cancer and the androgen receptor (AR), a transcription factor that was not assessed by ENCODE but as a control in a separate study [124]. Of the 9 functional SNPs for prostate cancer that overlap a ChIP-seq peak, 5 overlap an AR ChIP-seq peak (Table 5.15). CHAPTER 5. INTEGRATING REGULATORY INFORMATION 81

23#4+*(, )%&*+, 56&",*$', -#*., 7"8!91#:,

!"#$%&'(# /0!1 .*&*,1#& !2-;<= >0?@= 7)7? 7AB!B A!CDD ;?EC /!8F /)=)C ?2E=F G=)=< ;=H#6P"& **& CQ O @ !" R @ S @ S @ C R C O @ < S < S < C S D C F < 7%4%$*4',"#*4&,.61#*1# // FS < F @ F C D F < F F D F D D D F D D < F F F D D < )'(#,<,.6*T# )' F< C F @ < C < < < F < F F < < F F < F D < D D F D D 7%P$6&63#,&,(#4U%4V*$W# ', S D D F D D D D D D D F F D D D D F D D D D D D D D B6(%+*4,.61%4.#4 ', R D < F D D D F D F D D D D D D D D D D F D D D D F B%.',V*11,6$.#X ', F< F D F C F C F F D D D D D D D D D D F D D D F D F J+W#4*&63#,W%+6&61 '- F@ F < F F F F D F D F D F D F F D < D F D F F D < D B6(%+*4,.61%4.#4,*$.,1W"6Y%("4#$6* '( Z D F F F D D F D F F D D D F D F D D D D D F F F F 74%"$[1,.61#*1# '' F< O C < C @ < D F F < < @ @ @ C @ F F C C F D D C F /'1&#V6W,+\(\1,#4'&"#V*&%1\1 '* Q C F D D F < C D D D D F D D < D D @ < F < D F D D /\..#$,W*4.6*W,*44#1& '* FD < C < D F F D F F D F D D D < D D D D F D D F F D K#$*4W"#,]*P#,*&,%$1#&^ '* S < D < F D F F F D D F F F D F D D D D F D D D F F 7%P$6&63#,(#4U%4V*$W# '+ S F F D D D D D F D D D D F D D D D D D D D D D D F )'(#,F,.6*T# &, FD < F F D F D F F D D D D D D D D F F F D F D F D F =&&#$&6%$,.#U6W6&,"'(#4*W&636&',.61%4.#4 &, O D D F < D F D D D D D D D D D D D F D D F D D D D !4%1&*&#,W*$W#4 &- Q < F D D < F D D D C D < F F < < F D F F D " D D D B4#*1&,W*$W#4 &( FF C D C < < < < F @ < < F D C F < D < F D F D < D D !4%$,:\*$&6&*&63#,&4*6&,+%W6 &/ R < D F D F D D < D D D D F F F D F D F D F D F D D ;#1(%$1#,&%,1&*&6$,&"#4*(' &/ @ D D F F F D D D F D F F F F D D D D D D D D D D D =V'%&4%("6W,+**+,1W+#4%161 &) < D D D D D D D D D F D D D D D D D F D D F D D D D ;#1(%$1#,&%,*$&6(1'W"%&6W,&4#*&V#$& &' C F D F D D D D D D D D F D D D D D D D D D D F D D ;"#\V*&%6.,*4&"46&61 &' FF F F F D D C F F D D F F D D F F D < F F < F F D D >#V*&%+%P6W*+,*$.,T6%W"#V6W*+,&4*6&1 &' R F < F F F D D D < F F F D F F D F D F D F F D D D -%$P#36&' && @ D F F D D D F D D F F D D D D D D D D D D D D D D _),6$+*+ && S D D F D D F F D D F F D D D D D D D D D D D D D F 8$U%4V*&6%$,(4%W#116$P,1(##. &+ S D F D F C F F F D F D D F D F D F D F F D D D D D K\+&6(+#,1W+#4%161 &+ S F < F < F D D < F D F D < F D D D D F F D D D D D _\*$&6&*&63#,&4*6&1 &+ C F D F D D D F D D F F F D D F D D D D D D D D D D

Figure 5.5: Phenotype level overview of the overlap between associations and ChIP-seq binding. This matrix view shows phenotypes vertically and DNA binding proteins assessed using ChIP-seq horizontally. Each cell represents the number of lead SNPs for the respective phenotype that overlap with a ChIP-seq peak for the respective DNA binding protein or are in strong LD (r2 ≥ 0.8 in the CEU HapMap 2 population) with a SNP that overlaps such a peak. Only phenotypes with at least 20 lead SNPs, and DNA binding proteins overlapping at least 30 functional SNPs are shown, but totals are computed over the entire data set. The significant interaction between height- associated functional SNPs and CTCF, as well as the association between prostate cancer-associated functional SNPs and androgen receptor (AR) are represented in bold font. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 82

Associated SNP Functional SNP Gene Study P-value chr1 rs6686842 rs11209342 SCMH1 Weedon et al. 2008 [125] 2 · 10−8 chr2 rs2580816 NPPC, DIS3L2 Lango Allen et al. 2010 [92] 6 · 10−22 chr2 rs6724465 rs13419740 NHEJ1 Weedon et al. 2008 [125] 2 · 10−8 chr3 rs9863706 RYBP Lango Allen et al. 2010 [92] 4 · 10−13 chr6 rs6899976 L3MBTL3 Gudbjartsson et al. 2008 [126] 6 · 10−6 chr6 rs7742369 KRT18P9, CYCSL1 Okada et al. 2010 [127] 1 · 10−13 chr7 rs2730245 rs6965685 WDR60 Lettre et al. 2008 [128] 3 · 10−7 chr9 rs7032940 rs7036157 AKAP2, C9orf152 Kim et al. 2010 [129] 3 · 10−6 chr9 rs946053 COL27A1 Gudbjartsson et al. 2008 [126] 2 · 10−7 chr14 rs1950500 rs12590407 RIPK3, NFATC4 Lango Allen et al. 2010 [92] 2 · 10−18 chr15 rs8041863 rs8030631 ACAN Weedon et al. 2008 [125] 8 · 10−8 chr17 rs3760318 ADAP2 Gudbjartsson et al. 2008 [126] 2 · 10−9 chr17 rs2665838 GH2, CSH1 Lango Allen et al. 2010 [92] 5 · 10−25 chr19 rs12986413 rs1015670 DOT1L Lettre et al. 2008 [128] 3 · 10−8 chr2 rs5751614 BCR Gudbjartsson et al. 2008 [126] 6 · 10−6

Table 5.14: Height-associated functional SNPs overlapping CTCF binding sites

An empty cell in the Functional SNP column indicates that the associated lead SNP is also functional. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 83

Associated SNP Functional SNP Gene Study P-value chr4 rs7679673 rs10007915 RPL6P14 - TET2 Eeles et al. 2009 [130] 3 · 10−14 chr6 rs339331 RFX6 Takata et al. 2010 [131] 2 · 10−12 chr7 rs10486567 JAZF1 Thomas et al. 2008 [132] 2 · 10−6 chr8 rs1456315 SRRM1P1,POU5F1B Takata et al. 2010 [131] 2 · 10−29 chr17 rs1859962 CALM2P1,SOX9 Schumacher et al. 2011 [133] 3 · 10−11 Eeles et al. 2009 [130] 2 · 10−16 Eeles et al. 2008 [134] 1 · 10−6 Gudmundsson et al. 2007 [135] 3 · 10−10

Table 5.15: Prostate cancer-associated functional SNPs overlapping AR bind- ing sites

An empty cell in the Functional SNP column indicates that the associated lead SNP is also functional.

5.3 Discussion

In this work, we used data generated by the ENCODE consortium to identify regu- latory and transcribed functional SNPs that are associated with a phenotype, either directly in a genome wide association study or indirectly through linkage disequilib- rium with a GWAS association. We further added eQTL information, thus identi- fying SNPs that are associated with a phenotype, for which there is evidence that they affect a regulatory region or a transcribed region, and for which a downstream target affected by the SNP is known. This approach therefore has the potential to provide putative mechanistic explanations for GWAS associations. We showed that this method is successful in identifying a functional SNP for a majority of previously reported GWAS associations (up to 81% when considering association studies per- formed in populations of European descent, and using the CEU population to obtain linkage disequilibrium information). The fraction of associated SNPs for which we can provide a functional annota- tion is similar to the one reported in the ENCODE integrative analysis paper [119]. The integrative analysis uses both DNase-seq and Formaldehyde-Assisted Isolation of CHAPTER 5. INTEGRATING REGULATORY INFORMATION 84

Regulatory Elements (FAIRE) [136] data to identify regions of open chromatin, and thus finds a slightly larger fraction of the associated SNP to overlap or be in LD with open chromatin regions compared to our approach, which does not use FAIRE data. We found that GWAS associations are significantly enriched for DNase hypersensi- tivity peaks, DNaseI footprints and ChIP-seq peaks even when accounting for most features of associated SNPs. Our results are consistent with chromatin state-based methods [115], in which a segmentation approach was used in order to identify enrich- ment for disease associations in predicted enhancers. Segmentation-based approaches use machine learning methods to predict chromatin state at every position in the genome based mostly on histone information. These predictions are then compared to GWAS results, thus showing enrichment for predicted states. A major difference of our work is that we directly used ChIP-seq and DNaseI-seq functional data in our analysis, and show enrichment for observed ChIP-seq peaks or DNaseI hypersensi- tive regions. In this work, we demonstrated that there is significant enrichment of GWAS associations for these types of data. Furthermore, we found that 1) inte- grating multiple types of functional data and expression information identifies more likely candidate causal SNPs within an LD region, and 2) phenotypic information from GWAS studies can be associated with biochemical data. Existing methods for prioritizing SNPs based on their functional role focused on transcribed regions [94, 95, 96], whereas we focused on regulatory regions. In the con- text of regulatory regions, most approaches are based on motif information [101, 102], and approaches using experimental data have generally been limited to individual as- sociations [114]. The comprehensive data sets generated by the ENCODE consortium are the first to offer sufficient information to allow for genome-wide methods that rely on experimental information. We used enrichment to compare the sensitivity of our approach with motif-based methods. We found that there is no significant enrichment for GWAS associations amongst conserved motifs. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 85

5.3.1 Identifying functional SNPs in linkage disequilibrium with lead SNPs

We found that in most cases, there is more evidence supporting another SNP in strong LD with the lead SNP than the lead SNP itself. This is consistent with re- sults from fine mapping analyses that indicate that multiple variants in the linkage disequilibrium region surrounding a lead SNP appear to play a role in the phenotype of interest [137, 93]. This result is of particular importance for the interpretation of GWAS results, as LD patterns differ markedly between populations. If the functional SNP is in strong LD with the lead SNP in the population in which the GWAS was performed, but not in a different population, then the lead SNP will not be asso- ciated with the phenotype in this second population. An example of this situation is functional SNP rs1333047, which lies in a region associated with coronary artery disease. This SNP is in perfect LD with two lead SNPs in populations of European descent in which the studies identifying the associations were performed, but not in populations of African descent. This example is discussed in details in Section 6.4.

5.3.2 Comparison of functional assays

We integrated data from multiple types of functional assays in order to identify func- tional SNPs. We found that the highest enrichments are obtained when requiring functional SNPs to be supported by multiple sources of experimental evidence rather than only one. The highest enrichments are observed when using both eQTL informa- tion and ENCODE data, and when considering associations that have been replicated. A similar trend can be observed when examining individual assays. The more spe- cific the assay, the higher the enrichment for overlap amongst GWAS associations: DNase hypersensitivity peak, which broadly capture regions in which chromatin is accessible do overlap with a large fraction of SNPs in general, thus leading to rela- tively weak enrichments, whereas the enrichment is much higher for ChIP-seq peaks, which experimentally identify the binding of specific transcription factors and other molecules. There is a clear trade-off between the more significant enrichment we ob- serve, and the lower fraction of associations annotated with ChIP-seq peaks. The CHAPTER 5. INTEGRATING REGULATORY INFORMATION 86

ChIP-seq data generated so far by the ENCODE consortium only assesses 119 tran- scription factors, a fraction of the 1,800 known ones [119]. Most transcription factors are assessed in a small subset of the ENCODE cell lines, whereas DNase-seq has been performed on most ENCODE cell lines. DNase footprinting, which combines DNase- seq data with sequence and motif information, is useful to identify potential binding sites for transcription factors not assessed using ChIP-seq. An example of this situ- ation is functional SNP rs7163757, which is in LD with a lead SNP associated with type 2 diabetes. DNaseI footprinting identifies a nuclear factor of activated T-cells (NFAT) footprint that overlaps rs7163757. NFAT is part of the calcineurin/NFAT pathway [138], which has been involved in the regulation of growth and function of the insulin-producing pancreatic beta cells, and linked to the expression of genes known to be associated with type 2 diabetes [139].

5.3.3 Differences between tissue types

Transcription factor binding patterns are heterogeneous, and differ between tissue types. Assessing this heterogeneity has been a main motivation for the ENCODE project. One concern is that the cell lines from which the functional information is derived does not necessarily correspond to the tissue type that is most relevant to the phenotype of interest. A similar approach has been successfully used to identify functional SNPs that play a role in coronary artery disease based on a ChIP-seq assay performed in the immortalized HeLa cell line [114]. By choosing to use functional data across all tissues, we purposefully favor sensitivity over specificity. An example illustrating the benefits of this trade-off is rs2074238, a functional SNP associated with long QT syndrome. A ChIP-seq experiment identifies the binding of estrogen receptor alpha at this location in an epithelial cell line. Long QT syndrome is more prevalent in women [140, 141], menstrual cycle affects the QT interval [142], and estrogen therapy has been shown to affect the duration of the QT interval in postmenoposal women [143, 144]. ChIP-seq data for this transcription factor is only available for two cell lines, neither of cardiac origin. By limiting our approach to functional data obtained in cardiac tissues, we would have excluded a transcription factor whose CHAPTER 5. INTEGRATING REGULATORY INFORMATION 87

role in the phenotype is supported by extensive prior evidence. When examining all associations, the significant enrichments we report demonstrate that our current approach improves specificity compared to using motif information only. Although the ChIP-seq data generated so far by the ENCODE consortium is sparse, especially in terms of number of different tissues in which a transcription factor is assessed, the number of available data sets is growing rapidly. We expect that it will soon become possible to refine this approach by considering the most relevant tissue types only, thus further improving its specificity. A remaining challenge is the identification of specific tissue types that are relevant for a given phenotype. A specific example is a functional SNP we identify in the context of Alzheimers disease: in cell lines of hepatic origin, rs3764650 overlaps a binding site for HNF4A, a transcription factor known to mainly play a role in the . Although Alzheimers is a neurodegenerative disease, a recently published study shows that the liver might play an important role in the disease mechanism as well [145]. This example shows the benefits of looking broadly at all available experimental data from ENCODE.

5.3.4 Functional SNPs beyond reported associations

In this work, we focused on using ENCODE information in order to identify functional SNPs in strong LD with previously reported associations. It is however important to note that these SNPs only represent a small fraction of all the SNPs that overlap functional regions identified by ENCODE. SNPs that alter transcription factor bind- ing sites are likely to have some biologically important effect, and have an impact on some phenotype. Such a SNP will, however, only be found in a GWAS if the spe- cific phenotype it affects is assessed. Given this fundamental limitation of association studies, an orthogonal approach would be to study the functional effects of common SNPs regardless of their association with a phenotype. Furthermore, this effect ex- plains why the enrichments we observe, while significant, are relatively modest. We used a stringent null model in which a lead SNP is matched to a random SNP that is similar to the lead SNP, and in particular located at a similar distance to the near- est transcription start site. Associated SNPs are located more closely to genes than CHAPTER 5. INTEGRATING REGULATORY INFORMATION 88

SNPs in general, and therefore null sets are also biased towards SNPs that are likely to have some biological effect. Relaxing the null model leads to higher enrichments (Figure 5.4B,C).

5.3.5 Analysis at the at the phenotype level

We identify a significant association between height and CTCF, with 15 associated SNPs either overlapping a CTCF ChIP-seq peak, or in strong linkage disequilibrium with a CTCF peak. CTCF [146] is a transcription factor that plays a key role in insu- lators in the human genome [147, 148]. While CTCF plays a role in many biological processes [149], the association with height that we observe is significant when com- pared to the fraction of other GWAS associations overlapping CTCF. Methylation of a CTCF site has been shown to control the expression of IGF2 [147], a gene involved in embryonic growth regulation [150] that also plays a major role in muscle growth in pigs [151]. It is possible that disruptions of CTCF binding sites could inactivate insulators, either due to the genotype at loci in the binding sites, or due to the joint effect of genotype and methylation, and that these disruptions could have an effect on tissue growth, and thus height. We identify five functional SNPs that are associated with prostate cancer and overlap androgen receptor (AR) binding sites identified using ChIP-seq. The androgen receptor is activated by testosterone [152] and has been shown to play an important role in prostate cancer progression [153, 154], and therapy [155]. Our results therefore indicate that these associations might be related to a known mechanism involved in prostate cancer.

5.4 Conclusion

We showed that genome wide experimental data sets generated by the ENCODE consortium can be successfully used to provide putative functional annotations for the majority of the GWAS associations reported in the literature. The use of these experimental assays outperforms the use of in-silico binding predictions based on CHAPTER 5. INTEGRATING REGULATORY INFORMATION 89

sequence motifs when trying to identify functional SNPs associated with a phenotype in a GWAS. We demonstrate that an integrative approach combining genome wide association studies, gene expression analysis and experimental evidence of regulatory activity leads to the identification of loci that are involved in common diseases, and generates hypotheses about the biological mechanism underlying the association. In the majority of cases, the SNP most likely to play a functional role according to ENCODE evidence is not the reported association, but a different SNP in strong linkage disequilibrium with the reported association. Our approach, which builds directly on the publicly available RegulomeDB database, provides a simple framework that can be applied to the functional analysis of any genome wide association study.

5.5 Data

5.5.1 GWAS catalog

We use the NHGRI GWAS catalog [11], downloaded on August 10, 2011) to obtain a list of GWAS associations. The version of the catalog we use contains 5922 entries. Each entry associates a SNP with a phenotype in a study. Most studies report several associated SNPs and each of them has a separate entry. A SNP associated with two phenotypes in the same study results in two entries entries. The GWAS catalog provides additional information for each entry. We use information about the genotyping technology used (Affymetrix, Illumina or Perlegen), whether the SNP was directly genotyped or imputed, the statistical support for the association (P- value), the population(s) in which the study was performed and the presence or not of a replication cohort. We map each entry in the GWAS catalog to dbSNP 131, which results in a total of 5694 associations on the autosomes and on chromosome X. We exclude the single association on chromosome Y from our analysis. The 5694 associations represent a total of 4724 distinct SNPs, 470 phenotypes and 810 studies. 416 SNPs are associated in more than one study. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 90

5.5.2 Linkage disequilibrium

We use HapMap version 2 [7] and version 3 [67] in order to obtain Linkage Disequilib- rium information between SNPs. HapMap 2 provides a higher SNP density, whereas HapMap 3 provides information for more populations. The HapMap 2 data we use in- cludes 2,776,528 SNPs in the CEPH (Utah residents with ancestry from northern and western Europe, abbreviation: CEU) population, 2,554,939 SNPs in the Han Chinese in Beijing, China (abbreviation: CHB) population, 3,114,362 SNPs in the Yoruba in Ibadan, Nigeria (abbreviation: YRI) population, and 2,509,881 SNPs in the Japanese in Tokyo, Japan (abbreviation: JPT) population. We create an intersection set that includes all 2,135,736 SNPs that are assessed in all four HapMap 2 populations. We use HapMap 2 and HapMap 3 data in order to create a list of pairs of SNP for which there is some evidence of LD (r2 ≥ 0.1) in any HapMap 2 or HapMap 3 population.

5.5.3 Genotyping arrays

We download the list of SNPs that appear on genotyping arrays from the SNP Geno- typing Array track of the UCSC genome browser [30] and use all SNPs that also ap- pear in dbSNP 132. Arrays include Affymetrix SNP 6.0 (905,283 SNPs), Affymetrix SNP 5.0 (435,360 SNPs), Affymetrix GeneChip Human Mapping 250K Nsp (257,159 SNPs) Affymetrix GeneChip Human Mapping 250K Sty (233,887 SNPs), Illumina Human Hap 650v3 (660,388 SNPs), Illumina Human Hap 550v3 (560,972 SNPs) Illu- mina Human Hap 300v3 (318,046 SNPs), Illumina Human1M-Duo (1,146,891 SNPs) Illumina Human CytoSNP-12 (299,358 SNPs), Illumina Human 660W-Quad (593,197 SNPs), and Illumina Human Omni1-Quad (972,372 SNPs). This track does not pro- vide information for the Perlegen arrays. We compute combined lists of all SNPs that appear on any Affymetrix array, and of all SNPs that appear on any Illumina array, remove SNPs that do not appear in all HapMap2 populations, and remove SNPs on Chromosome Y. We obtain a total of 1,006,273 SNPs for the Illumina arrays, and 920,693 SNPs for the Affymetrix arrays. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 91

5.5.4 SNP properties

We use the function information generated by the UCSC genome browser for each SNP in dbSNP 132 [156]. The functional role is predicted based on UCSC genes. Functional classes are: near the 3’ end of the gene (within 500 bases of a transcript), near the 5’ end of a gene (within 2kB of a transcript), coding synonymous, coding non-synonymous (nonsense, missense, frameshift, coding indel or coding unknown), 3 or 5 untranslated regions, , splice sites or unknown (intergenic regions). These classifications do not use GENCODE v7 information, or any regulatory information from ENCODE. We use the UCSC Genes track in order to find the closest transcrip- tion start site (TSS) for each SNP, and the dbSNP allele frequency information to determine the minor allele frequency of each SNP. We use a total of 26,561,892 SNPs from dbSNP 132.

5.5.5 Functional annotations

We use he November 7, 2011 version RegulomeDB1 in order to annotate SNPs with regulatory information. RegulomeDB integrates various types of assays, including DNaseI-seq peaks and ChIP-seq peaks, both generated by the ENCODE consortium, DNaseI footprints, conserved motifs, eQTLs curated from several studies in multiple tissues, and validated functional loci. For each SNP RegulomeDB provides a list of datasets in which there is evidence of function in a region overlapping the SNP, as well as a score indicating the confidence that the SNP is functional based on all the available evidence for the locus. The RegulomeDB ENCODE companion paper describes the database and scoring metric in more details [123]. We run RegulomeDB on every SNP in dbSNP 132. There is some evidence of function for a total of 13,453,666 SNPs (50.7%). In this study, we use a slightly modified format of the RegulomeDB scores in order to integrate linkage disequilibrium and eQTL information. A RegulomeDB score of 1a through 1f indicates that a SNP is an eQTL, with the letter indicating how much other functional information supports there is for the SNP. Each letter maps to another score (2a through 5), with the

1http://RegulomeDB.org CHAPTER 5. INTEGRATING REGULATORY INFORMATION 92

Score Evidence supporting the functional SNP in RegulomeDB 2a ChIP-seq peak, matched DNaseI footprint, matched motif, DNaseI peak 2b ChIP-seq peak, DNaseI footprint and peak, and motif 2c ChIP-seq peak, matched motif and DNaseI peak 3a ChIP-seq peak, DNaseI peak and motif 3b ChIP-seq peak and matched motif 4 ChIP-seq and DNaseI peak 5a ChIP-seq peak only 5b DNaseI peak only 6 Motif only 7 No annotation

Table 5.16: Modified RegulomeDB scoring scheme. Lower scores indicate more evidence for the SNP to be in a regulatory region. only difference being that the higher scores denote SNPs that are not eQTLs. We map scores between 1a and 1f back to the corresponding scores between 2a and 5, and handle eQTLs separately. We also create two additional scores, 5a for SNPs overlapping ChIP-seq peaks only, and 5b for SNPs overlapping DNaseI-seq peaks only, whereas RegulomeDB uses a score of 5 for both. Table 5.16 provides an overview of the modified scoring scheme.

5.5.6 Transcribed regions

We use GENCODE v7 [157] to identify SNPs that overlap transcribed regions. We intersect the GENCODE v7 Genes basic set track from the UCSC browser with all SNPs in dbSNP 132 and determine whether the SNP lies in an exon. If the SNP is in an exon, we use the coding region start and end information of the browser table in order to determine whether the SNP is in a coding region or is in a non-coding part of an exon. We consider introns in a similar way than intergenic regions since regulatory elements can be found in both. This leads to a transcriptional annotation for each SNP. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 93

5.6 Methods

5.6.1 Lead SNP annotation

We call the associated SNP reported in a GWAS lead SNP. For each lead SNP we retrieve the regulatory annotation from RegulomeDB, and the transcriptional an- notation from GENCODE v7. We determine the fraction of lead SNPs that are coding, in non-coding parts of exons, that overlap DNaseI peaks, DNaseI footprints and ChIP-seq peaks independently of each other. This means that if, for example, a SNP overlaps both a DNase peak and a ChIP-seq peak, then it will be counted for both types of assays. We consider that there is an overlap between the SNP and the type of assay if there is one ENCODE cell line in which there is respectively a DNase peak, a DNase footprint for at least one motif, or a ChIP-seq peak for at least one binding protein that overlaps the SNP. In order to determine a score for lead SNPs, we first assess whether the SNP is an exon. If the SNP is not in an exon, then we assign the modified RegulomeDB score to this SNP. We use Fisher’s exact test on a 2x2 table to compute a P-value for the difference in the fraction of functionally annotated SNPs between all lead SNPs and lead SNPs that are eQTLs.

5.6.2 Linkage disequilibrium integration

For each lead SNP we compute the set of all SNPs in LD with that lead SNP. We first use an r2 threshold in order to limit the LD set to SNPs in strong LD with the lead SNP. In order to add a SNP to the LD set, we require that the r2 is above the threshold in all four HapMap 2 populations. We then look separately at associations found in populations of European descent. For each of these lead SNPs, we obtain a set of SNPs in LD with the lead SNP when considering the HapMap 2 CEU population only. We separately analyze the set of all lead SNPs, and the subset of European-descent lead SNPs. In order to compute the fraction of SNPs in LD with a lead SNP that overlap a type of functional data, we do count every lead SNP at most once, namely when one or more SNPs in the LD set overlaps with the functional data type. In order to CHAPTER 5. INTEGRATING REGULATORY INFORMATION 94

compute a score, we find the best candidate in the LD set corresponding to each lead SNP. We consider that a coding SNP had more functional evidence than a SNP in a non-coding part of an exon, and that a SNP in an exon has more functional evidence than a regulatory SNP. If no SNP in the LD set is transcribed, then we find the SNP with the best RegulomeDB score. We consider an associated region to be an eQTL if there is at least one eQTL in the set of SNPs in LD with the lead SNP. The use of linkage disequilibrium to identify functional SNPs is based on the as- sumption that the linkage disequilibrium structure used in this analysis is the same than in the population in which the association study was performed. This assump- tion is necessary in order to consider that a functional SNP in strong linkage disequi- librium with a lead SNP is associated with the phenotype. If the two SNPs are not correlated in the actual study population, then there is no evidence that the func- tional SNP has an effect on the phenotype. Linkage disequilibrium patterns differ significantly between populations, and it is therefore challenging to obtain linkage disequilibrium information that closely matches the population in which the GWAS was performed. We choose to use a conservative approach in which we consider two SNPs to be in strong linkage disequilibrium only if they are in strong linkage dise- quilibrium (r2 ≥ 0.8) in all four HapMap 2 populations. These populations are of European, African, Japanese and Chinese origin, and thus encompass a large part of the variation between populations, and in particular represent populations that diverged early on in recent human evolution. Linkage disequilibrium patterns that are conserved across all four populations are likely to be conserved in the populations studied in GWAS as well, and this approach should thus reduce the number of false positives amongst the functional SNPs we identify. This clearly comes at the cost of sensitivity: when considering SNPs in strong LD with the lead SNP in all pop- ulations, 33% of the lead SNPs are annotated but have no SNP in LD with them. We separately repeat our analysis by considering functional SNPs that are in strong LD in the CEU HapMap population with a lead SNP associated with a phenotype in a population of European descent. This increases the number of SNPs assessed for functional evidence. The fraction of lead SNPs that are mapped to a functional SNP CHAPTER 5. INTEGRATING REGULATORY INFORMATION 95

also increases: 80% of the lead SNP are found to be in strong LD with a SNP over- lapping a region identified to be functional in at least one ENCODE assay. Recent analysis of the genetic structure of the European population do however show sig- nificant differences between different sub-populations, and it is therefore important to keep in mind that further analysis is needed, as the correlation between a lead SNP and an functional SNP in the HapMap CEU population might be weaker the population in which the original association was performed.

5.6.3 Randomization

We use a conservative approach in order to estimate enrichment. We do so in two steps, by first filtering lead SNPs and then generating random null sets that closely match the properties of the lead SNPs. In the filtering step, we ensure that the lead SNPs that we use when computing enrichment are independent. If two lead SNPs were in strong linkage disequilibrium, then the set of SNPs in LD would likely overlap, and a functional region in LD with both SNPs would be double counted. This is a fairly likely situation given that there are large regions of perfect linkage disequilibrium, and different chips use different tag SNPs to genotype the same region. If a phenotype is assessed on two different platforms, then two different SNPs could be reported as significant even though they are both in the same associated region. Accurately replicating the fine linkage disequilibrium structure between lead SNPs when building random null sets would be extremely challenging, and we therefore decide to consider only SNPs that are not in LD with each other in order to estimate enrichment. Second, as our method for identifying associated functional SNPs relies on linkage disequilibrium information from HapMap, we do need to ensure that the fraction of SNPs that are assessed in HapMap does not differ substantially between the actual lead SNP set and the random sets. If we had allowed a lead SNP to be matched to any SNP in dbSNP, then LD information would only be available for a small subset of the random null sets, which would artificially decrease the number of functional SNPs identified in those random sets. In order to avoid any subtle difference that could bias our enrichment estimation, we limit the set of lead SNPs to SNPs that CHAPTER 5. INTEGRATING REGULATORY INFORMATION 96

have been assessed in all four HapMap 2 populations, and similarly limit the set of random SNPs. While this substantially decreases the set of lead SNPs that are used to compute enrichment estimates, the fraction of lead SNPs overlapping functional regions in this smaller set (Tables 5.4 and 5.9) is comparable to using all reported associations (Tables 5.1 and 5.2). We then compute a random set that matches the properties of the lead SNPs. A lead SNP is matched to a SNP with similar minor allele frequency, since minor allele frequency can affect the strength of the statistical association between a SNP and a phenotype. We then map each lead SNP to another SNP on the same genotyping array. This corrects for biases that may result from the choice of SNPs put on the genotyping array. We also map a lead SNP to a SNP that has the same function (with respect to UCSC genes), such that the fraction of coding SNPs, for example, is the same in the random sets than amongst lead SNPs. Finally, we also ensure that each lead SNP is mapped to a random SNP that is at a similar distance to nearest transcription start site. This avoids the situation in which an association that is close to a gene, and thus more likely to overlap with some functional data is matched by a SNP in a so-called gene desert in a null set. We still obtain significant enrichment for functional regions even when taking all these factors into account, and when comparing the most stringent random set to all lead SNPs. While using such a conservative approach and still reaching significance is strong evidence that there is enrichment for functional elements in regions associated with disease and other phenotypes, one can argue that matching the properties of lead SNPs that closely actually amounts to overcorrecting. For example, while there appears to be a bias for associated SNPs to be closer to known genes than random SNPs on the same genotyping chip, this is actually a biologically interesting property of disease associations. By requiring the matched null sets to also lay more close to known genes, we increase the probability that those random SNPs are in regulatory regions, or in strong LD with a regulatory region. It is likely that the functional SNPs identified when using random null sets, and in particular those supported by multiple types of functional data, also affect some phenotype in some way, but that this association has yet to be discovered in a GWAS. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 97

Filtering

In order to obtain null sets that are similar to the set of associated SNPs, we only consider SNPs that were assessed in all HapMap 2 populations, for which the minor allele frequency is known in dbSNP, and that we can map to a genotyping platform. We use the GWAS catalog to determine whether an associated SNP was found using an Affymetrix or an Illumina array, and then determine whether the SNP is the corresponding list of SNPs we have previously computed. If the SNP is not in the list, or if the platform is not Affymetrix or Illumina, then the lead SNP is filtered out. An exception is the case where the SNP was found using imputation, in which case the SNP is kept in the set as long as it is in HapMap 2. A total of 1160 lead SNPs are filtered out at this stage. We then search for pairs of lead SNPs that show some evidence of linkage disequilibrium between them (r2 ≥ 0.1 in any HapMap 2 or HapMap 3 population). For each such pair we keep only the lead SNP with the stronger association (more significant P-value reported in the GWAS catalog), and repeat this process until no pair of SNPs in linkage disequilibrium with each other is left in the filtered set. A total of 1200 lead SNPs are filtered out due to LD, leaving us with a set of 2364 SNPs. We repeat the annotation steps on this new set, and perform all enrichment comparisons using this set and its subset of European association, which is obtained in a similar way than for the set of all lead SNPs.

Matched null sets

We compute several matched null sets that model the properties of the associated SNPs increasingly closely. We group SNPs into bins based on their minor allele fre- quency, with each bin representing a 5% minor allele frequency interval. For all null sets, a lead SNP is always matched to a random SNP in the same minor allele fre- quency bin. We also ensure that SNPs in each null set are not in linkage disequilibrium with each other, using the same criterion than for the lead SNPs. The null-dbSNP set is obtained by matching each lead SNP to a SNP in dbSNP 132. The null-HapMap2 set is obtained in a similar way, except that the matched SNPs must be amongst the SNPs assessed in all HapMap 2 populations. The null-Array set is obtained by CHAPTER 5. INTEGRATING REGULATORY INFORMATION 98

matching a lead SNP with a SNP that appears on the same platform (Illumina or Affymetrix) than the lead SNP, or a SNP that is in HapMap 2 if the SNP has been imputed. If the original study includes both Affymetrix and Illumina platform, and the lead SNP is present on both, then the SNP is matched to a random SNP from ei- ther platform. The null-Array-Function set uses the same criteria than the null-Array set, but in addition also requires that the matched lead SNP is in the same functional category (as predicted using UCSC genes, see above) than the lead SNP. Finally, the null-Array-Function-Distance is obtained by also requiring that the matched SNP is at a similar distance to the nearest transcription start site (with respect to UCSC genes) than the lead SNP if the lead SNP is located in an intergenic region or an intron. We group SNPs into bins according to the logarithm of their distance to the nearest transcription start site, and each bin represents a log10(distance) interval of

0.1. All SNPs with a log10(distance) under 3.5 are grouped into one bin, and all SNPs with a log10(distance) above 13 are grouped into one bin. While we use all these ran- dom null sets in Figure 5.4B and C, all the remaining enrichments presented in this work are only computed using the most stringent null-Array-Function-Distance sets.

Statistical analysis

We create n = 100 matched null sets and then repeat the annotation steps on each null set, and obtain an empirical distribution of the fraction of functional SNPs ex- pected for matched SNPs, and of the score distribution amongst matched SNPs. We obtain a P-value for the difference between the lead SNPs and the null sets using a Students t distribution with n-1 degrees of freedom and the same mean and standard deviation from the empirical distribution of the counts overlapping the feature in the n randomized null sets. This distribution is used to estimate the probability of hav- ing a null set (which is by construction of same size as the set of lead SNPs) with a fraction of SNPs overlapping the feature that is as extreme or more extreme than the fraction observed for the lead SNP set, which results in a two-tailed P-value. CHAPTER 5. INTEGRATING REGULATORY INFORMATION 99

5.6.4 Analysis at the phenotype level

We group all lead SNPs per phenotype using the GWAS catalog phenotype classi- fication. We do not further group phenotypes, even though some are similar. We use only associations identified or replicated in populations of European descent. For each lead SNP, we count how many times the lead SNP or at least one SNP in strong LD (r2 ≥ 0.8 in the HapMap 2 CEU population) overlaps with a ChIP-seq peak for a given DNA binding protein. Each lead SNP is counted at most once for each DNA- binding protein, and we ensure that no two lead SNPs are in LD with each other. We then add the totals for all the lead SNP associated with each phenotype. We use Fishers exact test on a 2x2 table to show that the fraction of lead SNPs associated with height that are in strong LD with at least one SNP overlapping with a CTCF ChIP-seq peak is higher than the same fraction for all associated lead SNPs. Chapter 6

Analysis of Functional SNPs

In this chapter we describe in more details several functional SNPs identified using the methods discussed in Chapter 5. In particular, we present examples that show how ENCODE data can be used in order to generate novel interesting biological hypotheses.

6.1 Strongly supported functional SNPs

We identified functional SNPs in strong linkage disequilibrium for a large fraction of all reported associations. A table mapping each association to a list of candidate functional SNPs is available on the RegulomeDB website1. Table 6.1 highlights the lead SNPs supported by the strongest functional evidence. These overlap a ChIP-seq peak, a DNase peak, a DNase footprint and a predicted motif, and the transcription factor binding detected using ChIP-seq matches the conserved motif used in DNase footprinting. Table 6.2 provides a similar list for functional SNPs supported by the same amount of regulatory evidence, but which are in strong LD with a lead SNP in all HapMap 2 populations. The lead SNP itself is supported by less or no evidence of a functional role. Table 6.3 provides a list of functional SNPs supported by the same amount of regulatory evidence, but which are in strong LD with a lead SNP in the HapMap 2 CEU population, but not all HapMap 2 populations. Only associations

1http://RegulomeDB.org/GWAS

100 CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 101

Lead SNP Phenotype P-value chr1 rs1967017 Serum urate [158] 4 · 10−8 chr5 rs2188962 Crohn’s disease [159] 1 · 10−7 Crohn’s disease [160] 2 · 10−18 chr6 rs9491696 Waist-hip ratio [161] 2 · 10−32 chr6 rs9483788 Hematocrit [162] 3 · 10−15 Other erythrocyte phenotypes [162] 1 · 10−47 chr11 rs2074238 QT interval [163] 3 · 10−17 chr11 rs7940646 Platelet aggregation [164] 1 · 10−6 chr12 rs902774 Prostate cancer [133] 5 · 10−9 chr14 rs1256531 Conduct disorder (symptom count) [165] 4 · 10−6 chr15 rs17293632 Crohn’s disease [166] 3 · 10−19 chr16 rs4788084 Type 1 diabetes [167] 3 · 10−13 chr17 rs9303029 Protein quantitative trait loci [168] 4 · 10−7 chr19 rs10411210 Colorectal cancer [169] 5 · 10−9 chr19 rs3764650 Alzheimer’s disease [170] 5 · 10−17

Table 6.1: Overview of the lead SNPs most strongly supported by functional evidence. Each of these lead SNPs overlaps a ChIP-seq peak, matched DNase footprint, matched motif and a DNaseI-seq peak (RegulomeDB score of 2a). Rep indicates that the study includes a replication cohort. SNPs in bold are also eQTLs. identified in populations of European descent are used for this table. Functional SNPs in strong LD with the lead SNP are located as far as 170 kilo base pairs from the reported association. Each of the functional SNPs we identify is a biological hypothesis supported by experimental regulatory data, but which still requires further validation.

6.2 Replication of a previously validated functional SNP

We show that we can re-identify a previously validated functional SNP. Lead SNP rs1541160 is associated with Amyotrophic Lateral Sclerosis (ALS) in a GWAS, and there is no evidence that this SNP overlaps a functional region. However it is in perfect LD with rs522444, a functional SNP overlapping DNase hypersensitivity re- gions and ChIP-seq peaks in a large number of ENCODE cell lines. The authors of CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 102

LD (r2) Lead SNP Score Phenotype P-value Best SNP Distance CEU CHB JPT YRI in LD to lead SNP (bp) chr1 rs6686842 6 Height [125] 2 · 10−8 rs11209342 61,205 0.96 0.90 1.00 1.00 chr1 rs380390 6 Age-related macular degeneration [171] 4 · 10−8 rs381974 8,379 1.00 0.85 1.00 1.00 chr3 rs6806528 4 Celiac disease [172] 2 · 10−7 rs6784841 733 1.00 1.00 1.00 1.00 chr4 rs1800789 7 Fibrinogen [173] 2 · 10−30 rs4333166 1,232 1.00 1.00 1.00 1.00 chr5 rs3776331 7 Serum uric acid [174] 8 · 10−6 rs3893579 4,547 0.96 1.00 0.95 1.00 chr6 rs7743761 5b Ankylosing spondylitis [175] 1 · 10−303 rs6457401 1,248 0.93 0.95 1.00 1.00 chr6 rs642858 6 Type 2 diabetes [176] 2 · 10−6 rs1361248 21,711 0.94 1.00 1.00 0.92 chr7 rs12700667 4 Endometriosis [177] 1 · 10−9 rs1451385 6,918 0.85 0.92 1.00 1.00 chr9 rs3890182 5b HDL cholesterol [178] 5 · 10−7 rs3847302 940 1.00 1.00 1.00 0.94 HDL cholesterol [179] 3 · 10−10 chr9 rs2383207 7 Abdominal aortic aneurysm [180] 2 · 10−8 rs1333047 8,545 0.89 0.95 1.00 1.00 chr16 rs7197475 6 Systemic lupus erythematosus [181] 3 · 10−8 rs7194347 2,777 1.00 1.00 1.00 0.83 chr16 rs7186852 5a Systemic lupus erythematosus [181] 3 · 10−7 rs7194347 9,985 0.96 1.00 1.00 0.83 chr19 rs12986413 4 Height [182] 3 · 10−8 rs1015670 536 1.00 1.00 1.00 0.96

Table 6.2: Strongly supported functional SNPs in linkage disequilibrium with an associated lead SNP in all populations. Each best functional SNP in this table overlaps a ChIP-seq peak, matched DNase footprint, matched motif and a DNaseI-seq peak (RegulomeDB score of 2a), and is in strong LD (r2 ≥ 0.8) with the lead SNP in all four HapMap 2 populations. SNPs in bold are also eQTLs. the original study identified rs522444 due to its position in a putative SP1 binding site and experimentally validated its functional role [103] in altering the expression of gene KIFAP3.

6.3 A new functional SNP for type 2 diabetes

6.3.1 Results

We identify rs7163757 as a novel putative functional SNP associated with type 2 diabetes (Figure 6.1). This SNP is in strong LD with rs7172432, a SNP recently shown to be associated with type 2 diabetes in the Japanese population and repli- cated in a European population [206], and associated with insulin response in the Danish population [207]. This functional SNP is supported by evidence from both DNaseI hypersensitivity and ChIP-seq assays. DNase footprinting indicates that the functional SNP overlaps a potential NFAT binding site. Interestingly, the risk allele CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 103

LD (r2) Lead SNP Score Phenotype P-value Best SNP Distance CEU CHB JPT YRI in LD to lead SNP (bp) chr1 rs2816316 7 Celiac disease [172] 2 · 10−17 rs2984920 7,982 1.00 1.00 1.00 0.05 Celiac disease [183] 3 · 10−11 rs1323296 695 1.00 1.00 1.00 0.52 chr1 rs4949526 5a Bipolar disorder and schizophrenia [184] 4 · 10−7 rs4949524 9,517 0.84 0.69 0.59 0.00 chr4 rs4234798 2b Insulin-like growth factors [185] 5 · 10−10 rs4234797 26 1.00 0.79 1.00 1.00 chr6 rs1361108 5b Menarche (age at onset) [186] 2 · 10−8 rs9388486 106,446 0.92 1.00 0.00 0.51 chr6 rs9494145 7 Red blood cell traits [187] 3 · 10−15 rs9483788 2,949 0.87 0.73 0.89 1.00 chr7 rs1055144 5b Waist-hip ratio [161] 1 · 10−24 rs1451385 23,612 0.83 0.17 0.11 0.02 chr8 rs2019960 7 Hodgkin’s lymphoma [188] 1 · 10−13 rs7826019 5,723 1.00 0.00 0.63 0.96 chr9 rs7873102 7 structure [189] 6 · 10−7 rs776010 111,304 0.96 0.79 0.75 0.00 chr9 rs1333049 7 Coronary heart disease [190] 7 · 10−58 rs1333047 999 1.00 0.51 0.40 0.00 Coronary heart disease [191] 3 · 10−19 Coronary heart disease [10] 1 · 10−13 chr9 rs4977574 2c Coronary heart disease [192] 1 · 10−22 rs1333047 25,930 0.89 0.47 0.36 0.00 Coronary heart disease [193] 2 · 10−25 Myocardial infarction (early onset) [194] 3 · 10−44 chr9 rs3905000 7 MRI atrophy measures [195] 9 · 10−6 rs3847302 8,475 1.00 1.00 1.00 0.67 HDL cholesterol [196] 9 · 10−13 chr10 rs1561570 4 Paget’s disease [197] 4 · 10−38 rs10752286 4,377 0.96 1.00 1.00 0.70 Paget’s disease [198] 6 · 10−13 chr10 rs563507 5b Acute lymphoblastic leukemia (childhood) [199] 9 · 10−6 rs773983 38,857 1.00 0.00 0.00 0.13 chr11 rs7127900 5b Prostate cancer [130] 3 · 10−33 rs7123299 1,230 1.00 1.00 0.89 0.65 chr11 rs561655 5b Alzheimer’s disease (late onset) [200] 7 · 10−11 rs1237999 14,751 0.84 0.83 0.62 1.00 chr11 rs10898392 7 Height [201] 3 · 10−6 rs575050 174,791 0.81 0.95 0.42 0.73 chr12 rs2638953 7 Height [202] 7 · 10−17 rs10506037 116,731 0.84 1.00 1.00 0.00 chr14 rs7142002 7 Autism [203] 3 · 10−6 rs3993395 38,489 0.87 0.26 0.20 0.62 chr15 rs261334 5b HDL cholesterol [178] 5 · 10−22 rs8034802 1,952 0.83 0.27 0.48 0.57 chr17 rs12946454 5b Systolic blood pressure [204] 1 · 10−8 rs4792867 41,323 0.83 0.43 0.27 0.51 rs11657325 34,573 1.00 0.93 0.81 0.51 chr17 rs6504218 7 Coronary heart disease [193] 1 · 10−6 rs9902260 8,427 0.92 1.00 1.00 0.56 chr22 rs738322 4 Cutaneous nevi [205] 1 · 10−6 rs2016755 29,402 0.89 0.71 0.44 0.71

Table 6.3: Strongly supported functional SNPs in linkage disequilibrium with an associated lead SNP in the European population. Each functional SNP in this table overlaps a ChIP-seq peak, matched DNase foot- print, matched motif and a DNaseI-seq peak (RegulomeDB score of 2a), and is in strong LD (r2 ≥ 0.8) with the lead SNP in the HapMap2 CEU population. Associ- ation were identified and replicated in a population of European descent. SNPs in bold are also eQTLs. SNPs in italic are eQTLs in LD with the lead SNP, but do not have a RegulomeDB score of 2a. at rs7172432 is the common allele in the population (53%), and there is a single hap- lotype with frequency above 1% that includes the risk allele between the associated SNP and the functional SNP, but several alleles with high frequency that include the protective allele. CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 104

A: Scale 50 kb hg19 chr15: 62,370,000 62,380,000 62,390,000 62,400,000 62,410,000 62,420,000 62,430,000 62,440,000 62,450,000 62,460,000 UCSC Genes (RefSeq, UniProt, CCDS, Rfam, tRNAs & Comparative Genomics) C2CD4A C2CD4B C2CD4A C2CD4B Digital DNaseI Hypersensitivity Clusters from ENCODE DNase Clusters Transcription Factor ChIP-seq from ENCODE Txn Factor ChIP

Lead SNP: rs7172432

Scale Scale 500 bases 500 bases hg19 hg19 chr15: B:62,391,000chr15: 62,391,000 62,391,500 62,391,500 62,392,000 62,392,000 Transcription Factor ChIP-seq fromTranscription ENCODE Factor ChIP-seq from ENCODED: BAF155 BAF155 H H TBP TBP H H c-Jun c-Jun H H JunD JunD H H STAT3 STAT3 Hmmm Hmmm CEBPB CEBPB H H p300_(N-15) p300_(N-15) H H GATA3_(SC-268) GATA3_(SC-268) t t SMC3_(ab9263) SMC3_(ab9263) H H p300 p300 t t FOXA1_(C-20) FOXA1_(C-20) t t Open Chromatin by DNaseI HS fromOpen ENCODE/OpenChrom(Duke Chromatin by DNaseI HS University)from ENCODE/OpenChrom(Dukers7163757 University)rs11635441 rs7164359 rs8037894 rs11858355 rs8037796 rs6494306 rs6494307 rs17271458 rs7167878 rs7167881 rs7172432 HeLa-S3 Pk HeLa-S3 Pk HeLa-S3 Pk HeLa-S3 Pk Fibrobl Pk Fibrobl Pk FibP AG08395 Pk FibP AG08395 Pk .81 .95 FibP AG08396 Pk FibP AG08396 Pk .93 .93 .75 FibP AG20443 Pk FibP AG20443 Pk HSMMtube Pk HSMMtube Pk .95 MCF-7 Pk MCF-7 Pk .81 .93 Melano Pk Melano Pk PanIsletD Pk PanIsletD Pk .93 PanIslets Pk PanIslets Pk ProgFib Pk ProgFib Pk .93 Stellate Pk Stellate Pk T-47D Pk T-47D Pk T-47D Pk T-47D Pk Urothelia Dnse Pk Urothelia Dnse Pk Urothel UT DNs Pk Urothel UT DNs Pk

1.0 C: NFAT E: 2.02 C .525 CCGACGCGCC .508 A .525 1.5 T .475 CGCATAGGAC .192 G .475 1.01 CCCACGGAAT .100 bits 0.5 CCCACGGGAT .075 T

Information content Information T A C TG G C AT AA A C T A CA GG TT CG T C G CGC T T A T AGCGTAGGAC .050 0.00 5 10 CCCACGGAAC .025 AGTGATTTTTCCATTTTAAGCWebLogo 3.2 CCCACGGGAC .017 functional SNP: rs7163757

Figure 6.1: Functional SNP rs7163757

Multiple sources of evidence indicate that SNP rs7163757 is functional. (A.) Overview of the region between genes C2CD4A and C2CD4B. Functional SNP rs7163757 is indicated using a blue vertical line, lead SNP rs7172432 using a green vertical line. Multiple ChIP-seq and DNase-seq peaks can be seen, including one that overlaps rs71763757. (B.) Vicinity of functional SNP rs7163757. ChIP-seq binding is observed for multiple transcription factors in multiple cell lines. Due to space, DNase peaks are represented only for a subset of the peaks overlapping the region. (C.) Sequence around rs7163757 and motif for the NFAT binding site that overlaps the functional SNP. The minor allele is T. (D.) Linkage disequilib- rium region between the functional SNP and the lead SNP in the HapMap 2 CEU population. The two SNPs are in perfect LD (r2 = 1.0). (E.) Haplotypes between the functional SNP and the lead SNP. There is a single haplotype with frequency above 1% that carries the identified risk allele (A at rs7172432), whereas there are multiple haplotypes that include the protective allele. Haplotypes with frequency of less than 1% are not shown. CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 105

6.3.2 Discussion

The association between lead SNP rs7172432 and type 2 diabetes and insulin resis- tance was replicated in multiple populations of both European and Asian origin, the mechanism underlying the association is unclear. The region is located between genes C2CD4A and C2CD4B (C2 calcium-dependent domain containing 4A/B), whose bi- ological role is unknown and which had not been previously implicated in diabetes. This region contains additional SNPs associated with type 2 diabetes: rs1439955 in Chinese [208], rs11071657 [209] and rs17271305 [210] in European, that are in linkage disequilibrium with rs7163757 in the respective populations (respectively r2 of 0.523 in CHB, 0.521 in CEU and 0.283 in CEU). There is therefore strong evidence linking this region to diabetes. DNaseI footprinting identifies a Nuclear factor of activated T-cells (NFAT) footprint that overlaps rs7163757. NFAT is part of the Calcineurin/NFAT pathway [138], which has been involved in the regulation of growth and function of the insulin-producing pancreatic beta cells, and linked to the expression of genes known to be associated with type 2 diabetes [139]. Glucose and glucagon-like peptide-1 (GLP-1) together lead to the expression of NFAT, which regulates the transcription of the insulin gene [211]. The binding site affected by fSNP rs7163757 might thus be involved in linking glucose level to the expression of genes in this region. It is interest- ing to observe that the haplotype which includes the risk allele identified at rs7172432 is the major allele in the CEU population (50.8%), and that there is markedly less diversity between the lead SNP and the fSNP for the risk allele (a single haplotype) than for the protective allele (6 different alleles, three of which with a frequency above 5%). Long haplotypes are a sign of positive selection [212], and positive selection for type 2 diabetes risk allele would be compatible with the thrifty gene hypothesis [213] under which alleles that currently cause diabetes were advantageous in the past his- tory of the human species, as they increased fat storage and thus survival in times of starvation. In this context, the link between this region and glucose and GLP-1 through NFAT signaling would provide a putative mechanistic hypothesis for the as- sociation. Further work is however needed, in particular in determining which genes have differential expression correlated with SNPs in this region, and how they affect CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 106

pathways related to diabetes. This example does highlight the value of DNase foot- printing, which identified a motif that is very relevant to the associated phenotype, even though NFAT was not assessed in the current ENCODE ChIP-seq experiments. ChIP-seq experiments identified additional proteins bound in this region that might interact with NFAT in order to regulate the expression of genes in this region.

6.4 The 9p21 region in coronary artery disease

The 9p21 region of contains several SNPs strongly associated with coronary artery disease in multiple studies across different populations [214]. Risk alleles in 9p21 leads to a 20-30% per allele increase in disease risk [215]. The region containing the SNPs associated with coronary artery disease is adjacent to a region containing SNPs associated with type 2 diabetes [216]. The 9p21 region is a gene desert: the closest known gene to 9p21 is located hundreds of thousands of kilobases away. Furthermore, strong linkage disequilibrium has been observed across the 9p21 region. Figure 6.2 provides an overview of the features of the 9p21 region. This region is therefore a perfect use case for approaches aimed at detecting variants that play a biological role in disease. High throughput functional information has also been used to identify a large number of enhancers in the 9p21 region, and determine that two SNPs associated with coronary artery disease overlap with an enhancer and disrupt a STAT1 binding site [114]. In this section we discuss several specific functional SNPs, and in particular pro- vide evidence indicating that rs1333047, a SNP in perfect linkage disequilibrium with coronary artery disease associated SNP rs1333049 in the CEU population only, is likely a functional SNP. This result can explain why the association between rs1333049 and coronary artery disease has not been replicated in populations of African descent. CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 107

Chromosome 9:

rs1333049 rs10757278 rs1333045 rs4977574

Figure 6.2: Overview of the 9p21 region SNPs significantly associated with coronary artery disease are indicated. The SNPs shown on Figure 6.3 are highlighted in green. These SNPs are located over 100 kilo bases away from the nearest genes CDKN2B and CDKN2A. Exons of a non-coding RNA, CDKN2BAS (also called ANRIL) stretch across the whole region. Strong linkage disequilibrium can be observed across the entire region. CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 108

6.4.1 Results

The 9p21 gene desert is a region that contains multiple SNPs that are strongly as- sociated with coronary artery disease. We consider the functional information avail- able from ENCODE in order to generate candidate functional SNPs in this region. The association between rs1333049 and coronary artery disease has been replicated in multiple studies in populations of European descent [10, 191, 217, 190] as well as in populations of Japanese and Korean descent [218, 219]. In the HapMap 2 CEU population, this SNP is part of a haplotype block that includes rs10757278 and rs1333047, both of which are in perfect LD with rs1333049. rs10757278 has also been itself associated with coronary artery disease in multiple populations of European descent [214, 220, 217] and in the Chinese Han population [221]. Figure 6.3 provides an overview of this region. There is no evidence supporting a functional role for rs1333049. However, both rs10757278 and rs1333047 overlap a DNase hypersensitiv- ity peak as well as ChIP-seq peaks for STAT1 and STAT3 in HeLA-S3 cells, and are therefore functional SNPs. Furthermore, rs10757278 lies in a STAT1 binding site, and rs1333047 lies in a binding site and a DNaseI footprint for Interferon-stimulated gene factor 3 (ISGF3). The motif is a good match when extending the less specific part of the motif (positions 8-9) located between the two very specific regions (positions 2-7 and 10-13) by a base pair. This is similar to cases of variable spacer length pre- viously observed for transcription factor binding motifs [222]. While the functional role of rs10757278 has been previously reported [114], evidence of the functional role of rs1333047 is novel. Interestingly, while only 27 base pairs separate the two SNPs, they are in perfect linkage disequilibrium in the CEU population only. The frequency of the A allele at rs1333047 in the Yoruba in Ibadan, Nigeria (YRI) HapMap 2 pop- ulation is only 0.8%, compared to 50.8% in the CEU population. This allele is part of the protective haplotype found in GWAS performed in populations of European descent. The A allele is part of the motif for ISGF3 binding, whereas the T allele is not. The most recent meta-analysis in populations of European descent [192] identifies rs4977574 as the most strongly associated locus in 9p21. We find that this SNP overlaps DNase hypersensitivity peaks in two ENCODE cell lines (Hah and Lncap), Scale 500 bases hg19 CHAPTER 6.Scalechr9: ANALYSIS OF FUNCTIONAL22,124,500500 bases SNPS 22,125,000 hg19 22,125,500109 chr9: 22,124,500Transcription Factor Binding Sites by ChIP-seq from ENCODE/Stanford/Yale/USC/Harvard22,125,000 22,125,500 HeLa IFg3 STA122,124,000 Sd 500 basesTranscription Factor Binding Sites by ChIP-seq from ENCODE/Stanford/Yale/USC/Harvard 22,125,500 HeLa IFg3 STA1HeLa Sd IFg3 STA1 Sd HeLa STA3 IgRHeLa STA3 IgR STAT1 ISGF3 22.0 Simple Nucleotide Polymorphisms (dbSNP build 21302.0 -2.0 Provisional Mapping to GRCh37) 1.5 rs35087431 1.5 1 rs35087431 1 1.0 1.0 1.0 bits bits bits 0.5 rs72655404GG 0.5 CC

Information content Information T content Information CC AA TT CC CC CC CC C C A A GG AA GAC T C AA TTTT GG C G GG C T C Ars72655404A T TT CGCG AA G T T A G T T T T C A A T AT C G GG CC GG GG CC AA AAA TTT A TT GG CG CG A TT AT A TTTT TT 00.0 00.0 0.0 5 10 15 5 5 *10 10 15 15 GTCATTCrs1333046CGGTAAGCAGCGATGCAGAATCAAGACAGAGTWebLogo 3.2 AGTTTCTCCTTCTCTC..WebLogoWebLogo 3.2 3.2 G rs7857118 rs17761458rs7857118 rs10757277rs17761458 rs10811656rs10757277 rs1333049 chr9:22,125,503 rs1333049 rs1333047 chr9:22,124,504 rs1333047 rs10811656 chr9:22,124,472 rs10757278rs10811656 chr9:22,124,477 rs10757278 rs10757278rs1333047 A A G 50.8% D’ = 1.0 D’ = 1.0 rs72655405 CEU 2 2 rs1333047 G T C 49.2% r = 1.0 r = 1.0 G T C 48.9% rs10757279 CHB A A G 31.1% D’ = 1.0 D’ = 1.0 rs72655405 + rs4977575 19.4% r2 = 0.442 r2 = 0.978 JPT A T G rs72655406rs10757279 A T C 0.6% A T G 80.0% rs72655407rs4977575 A T C 10.0% D’ = 1.0 D’ = 0.78 YRI 7.5% G T C r2 = 0.002 r2 = 0.289 rs73443203rs72655406 G T G 1.7% rs73650062 A A G 0.8% rs72655407 rs1333049 and rs1333047 rs1333049 and rs10757278 rs73443203rs1333048 Figure 6.3:rs1333049 Evidence supporting the implication of rs1333047 in coronary artery disease.rs73650063rs73650062 Functionalrs1333048 data (ChIP-seq) generated by the ENCODE consortium shows evidence of STAT1rs1333049 binding in the 9p21 region associated with coronary artery disease. rs10757278rs73650063 and rs1333047 are both located in the peak, whereas rs1333049 is a tag SNP that does not overlap any functional region in RegulomeDB. rs10757278 is part of a regulatory motif for STAT1 binding, and rs1333049 is part of a regulatory motif for ISGF3 binding. The star symbol denotes the location at which a gap is inserted into the motif to handle variable linker length. Haplotype frequency and linkage disequilibrium data from the different HapMap2 populations shows that all three SNPs are in perfect linkage disequilibrium in the CEU population, but not the CHB and JPT populations. In the YRI population, the frequency of the A allele at rs1333047 is only 0.8%. Risk alleles for all SNPs are determined using the haplotype associated with coronary artery disease in the CEU population, and represented in red. There is an absence of linkage disequilibrium between rs1333047 and rs1333049 in YRI, and the association between rs1333049 and rs10757278 and coronary artery disease has not been replicated in populations of African descent. CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 110

a conserved motif for the Androgen Receptor (AR), and a ChIP-seq peak for AR in a data set from Wei et al. [124]. The RegulomeDB score for this SNP is 2c. In the YRI population, the minor allele frequency of this SNP is only 7.5%, and it is in strong LD with rs10757278 (r2 = 0.803) and in weaker LD with rs1333049 (r2 = 0.382). It is in strong LD with both in the CEU population (r2 of respectively 0.874 and 0.885). Previously identified functional candidate rs1333045 [104] obtains a RegulomeDB score of 3a, as it overlaps a conserved motif (Hand1::Tcfe2a), a ChIP-seq peak for GATA3 in the T-47D ENCODE cell line, and a DNase peak in the Huvec ENCODE cell line. This SNP has a minor allele frequency of 45.8% in the YRI population, and is in weak LD with both rs10757278 (r2 = 0.119) and rs1333049 (r2 = 0.251). In the CEU population it is in strong LD with both (r2 of respectively 0.808 and 0.815).

6.4.2 Discussion

A new functional SNP in 9p21 could explain the lack of association in populations of African descent We identify rs1333047 as a candidate functional SNP in the 9p21 region. This region is associated with coronary artery disease [223] and several other diseases, and the risk of coronary artery disease for the 25% of individuals in popula- tions of European descent that are homozygous for the risk allele is two times higher than for individuals homozygous for protective alleles [215]. This region is a gene desert, but the non-coding RNA ANRIL overlaps the SNPs associated with coronary artery disease [217]. A recent study in mice showed that the deletion of the region orthologous to 9p21 leads to changes in the expression of the orthologs of the two human genes closest to 9p21, cyclin-dependent kinase inhibitors genes CDKN2A and CDKN2B, and has effects on the phenotype [224]. SNPs associated with coronary artery disease affect the expression of ANRIL, and to a smaller extent CDKN2A and CDKN2B in human [225]. The candidate functional SNP rs1333047 potentially disrupts a binding site for Interferon-stimulated gene factor 3 (ISGF3). ISGF3 is part of the JAK-STAT (Janus Activated Kinase - Signal Transducer and Activator of Transcription) cascade. In Type-I-Interferon signaling, STAT1, STAT2 and IFN-regulatory factor 9 (IFN9) form CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 111

the ISGF3 complex [226] that binds to IFN-stimulated response elements (ISRE) in the nucleus. This contrasts with Type-II-Interferon signaling, in which STAT1-STAT1 homodymers directly bind to IFN-γ-activated sites (GAS). Interferon-γ is the only Type-II-Interferon. A review of both Interferon signaling pathways can be found in Platanias 2005. We find two functional SNPs in perfect linkage disequilibrium with each other and with tag SNP rs1333049 in the HapMap 2 CEU population. The second functional SNP, rs10757278, has been previously shown to be functional [114], and is located in a GAS. The experimental evidence supporting the functional role of rs10757278 does, however, also support rs1333047. Both SNPs are in perfect linkage disequilibrium in all individuals re-sequenced by Harismendy et al., and would thus be equally strongly associated with the phenotype. Harismendy et al. compare lymphoblastoid cell lines (which have a high expression level of STAT1) that are homozygous for the risk allele to lymphoblastoid cell lines that are homozygous for the protective allele at rs10757278. Given the perfect LD in this region, they are likely also homozygous respectively for the risk and protective alleles at rs1333047. They show that in cell lines homozygous for the protective allele, STAT1 knockdown leads to a 7-fold up-regulation of the expression of ANRIL, a non-coding transcript located in the 9p21 region, whereas there is a much smaller effect in cell lines homozygous for the risk allele. This would, however, also be consistent with an effect caused by rs1333047 since STAT1 is part of the ISGF3 complex, and a knockdown of STAT1 would therefore also affect ISGF3. Furthermore, they use ChIP to identify that STAT1 binds at rs10757278 only in cell lines with the protective allele. Binding of STAT1 in this region does not imply the absence of ISGF3 binding, and given that STAT1 is part of of the ISGF3 complex, it is also possible that ISGF3 binding was detected rather than binding of the STAT1-STAT1 homodymer. Finally, Harismendy et al. show that treatment with Interferon-γ leads to a change in expression of ANRIL in HeLA and HUVEC cell lines. As Interferon-γ is only know to be involved in the Type-II-Interferon pathway, this cannot be explained by ISGF3 binding to the ISRE at rs1333047. This change is, however, in the opposite direction to the change expected from the observation in the STAT1 knockdown experiment, which indicates that there might be multiple binding sites at play in this region. The evidence is CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 112

therefore compatible with a functional role for both rs1333047 and rs10757278. While both functional SNPs are in perfect linkage disequilibrium in the HapMap 2 CEU population, as well as in the individuals studied by Harismendy et al., this is not the case in other populations. In particular, the protective allele at rs1333047 is rare (0.8%) in the HapMap 2 YRI population. Major differences in allele frequency and linkage disequilibrium structure between populations in the 9p21 region have been previously identified [227]. Interestingly, multiple GWAS of coronary artery disease in populations of African descent failed to replicate the association at rs1333049 [228] and rs10757278 [229, 182]. While these studies did not replicate the most strongly associated SNPs in European populations, they did identify SNPs that are associated with coronary artery disease in populations of African descent. Two additional asso- ciatied SNPs, rs10757274 and rs2383206 were identified in the European [223], South Korean [220] but not replicated in African-American [223]. rs10757274 has been as- sociated with heart failure in individuals of European descent, but the association was not significant in African American [230]. Therefore there is strong evidence that associations with coronary artery disease identified in population of European origin in 9p21 are not replicated in populations of African origin. If rs10757278 is the functional SNP that has the largest effect on the phenotype in this region, then the absence of replication can only be explained by an interaction between the effect at rs10757278 and some other region in which the populations of African descent differ from the populations of European descent. If, however, rs1333047 functionally affects the phenotype, then the lack of replication can be explained by the lack of linkage disequilibrium between rs1333047 and the genotyped SNPs. Therefore, the lack of replication of this finding in populations of African descent supports rs1333047 as a candidate functional SNP in this region. Furthermore, rs1333047 is the strongest association with coronary artery disease association identified in an African Amer- ican population using a method that combines association and admixture informa- tion [231]. The linkage disequilibrium patterns also differ between the HapMap 2 CEU pop- ulation and the two Asian populations (CHB and JPT). Linkage disequilibrium does, however, remain relatively high (r2 of respectively 0.978 and 0.442 between rs1333047 CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 113

and associated SNPs rs10757278 and rs1333049). A large scale, gene centric analy- sis [232] showed that the effect size for the association between 9p21 SNP rs1333042 and coronary artery disease was larger in European than Asian (odds ratio of respec- tively 1.27 and 1.14). We analyze the linkage disequilibrium between rs1333047 and haplotypes identified in previous studies in the Han Chinese population [221], which includes rs2383206, rs1004638, rs17761446 and rs10757278. Only haplotype AATA includes the protective A allele at rs1333047, and rs1333047 is in strong linkage dis- equilibrium (r2 = 0.975) with rs1004638. The AATA haplotype is more frequent in controls (30.5%) than in cases (27.3%). Similarly, we analyze the haploblock reported in an association study of the Korean population [220]. Only two SNPs, rs2383206 and rs10757278 are also in HapMap 2. For these SNPs, only haplotype AA includes the protective A allele at rs1333047, and this haplotype is protective in the Korean population, with a frequency of 52.1% in controls and 45.1% in cases. Therefore previous results in population of Asian ancestry are also compatible with a potential functional role of rs1333047. The implications of this potential functional association are significant. A re- cent meta-analysis of 7 independent studies with a total of 9,487 cases and 30,171 controls [190] identifies rs1333049 as the strongest association with coronary artery disease (P-value 7.12 · 10−58), odds ratio 1.27). The original study that associates rs10757278 with coronary artery disease in the Icelandic population and three popu- lations in the United States [214] shows that the odds ratio for heterozygous carriers of the risk allele is 1.26, and the odds ratio for the homozygous carriers is 1.64, and that this association alone might explain up to 21% of the population attributable risk. This study did not specifically analyze rs1333047. Since all three SNPs are part of the same haplotype in the CEU population, and are perfectly correlated, the odds ratios and attributable risks would be similar for rs1333047. The differences in LD structure however mean that both rs10757278 and rs1333049 are poor proxies for rs1333047 in other populations. This has important implications in personalized medicine, as testing those SNPs would lead to incorrect risk predictions if rs1333047 is indeed the mutation that plays a functional role in the phenotype. Furthermore, a po- tential functional role for rs1333047 would mean that the Type-I-Interferon signaling CHAPTER 6. ANALYSIS OF FUNCTIONAL SNPS 114

pathway plays a role in the association at 9p21, in addition to the Type-II-Interferon pathway previously identified by Harismendy et al. Finally, it is important to note that in YRI the frequency of the protective allele is very low (0.8%), meaning that most individuals in populations of African descent might be at a higher risk for coro- nary artery disease if rs1333047 is the mutation that plays a functional role in the disease. Interestingly, given the low minor allele frequency at rs1333047 in the YRI population, even genotyping this locus would require a much larger population in order to reach statistical significance in this population. This example illustrates the power of combining the results of functional studies such as ENCODE with results from GWAS in multiple population, and in particular in considering populations in which replication of an association were unsuccessful together with linkage disequilib- rium data. Further experimental validation of the binding of ISGF3 to this region, of the effect of rs1333047 on this binding site, and of the association between rs1333047 and coronary artery disease across populations are however necessary to definitely prove that rs1333047 is a functional SNP linked to coronary artery disease. Given the evidence supporting a functional role for multiple loci in tight linkage disequilib- rium in 9p21, it appears likely that multiple SNPs in binding sites for transcription factors that are part of a broad range of pathways together play a role in the biological process underlying the association of this region with coronary artery disease.

6.5 Methods

We use Haploview [233] to analyze linkage disequilibrium data and haplotype fre- quencies in individual regions. We obtain transcription factor binding motifs from Transfac (STAT1, NFAT) and Jaspar (ISGF3). Motifs representations in Figure 6.1 and Figure 6.3 were created using WebLogo 3 [234]. Figures include graphical ele- ments generated using the UCSC genome browser [30]. Bibliography

[1] E.S. Lander, L.M. Linton, B. Birren, C. Nusbaum, M.C. Zody, J. Baldwin, K. Devon, K. Dewar, M. Doyle, W. FitzHugh, et al. Initial sequencing and analysis of the human genome. Nature, 409:860–921, 2001.

[2] J.C. Venter, M.D. Adams, E.W. Myers, P.W. Li, R.J. Mural, G.G. Sutton, H.O. Smith, M. Yandell, C.A. Evans, R.A. Holt, et al. The sequence of the human genome. Science, 291:1304–51, 2001.

[3] E.S. Lander. Initial impact of the sequencing of the human genome. Nature, 470:187–97, 2011.

[4] J.G. Taylor, E.H. Choi, C.B. Foster, and S.J. Chanock. Using genetic variation to study human disease. Trends in Molecular Medicine, 7:507–12, 2001.

[5] R. Sachidanandam, D. Weissman, S.C. Schmidt, J.M. Kakol, L.D. Stein, G. Marth, S. Sherry, J.C. Mullikin, B.J. Mortimore, D.L. Willey, et al. A map of human genome sequence variation containing 1.42 million single nucleotide polymorphisms. Nature, 409:928–33, 2001.

[6] The International HapMap Consortium. A haplotype map of the human genome. Nature, 437:1299–320, 2005.

[7] The International HapMap Consortium. A second generation human haplotype map of over 3.1 million SNPs. Nature, 449:851–61, 2007.

115 BIBLIOGRAPHY 116

[8] D.G. Wang, J.B. Fan, C.J. Siao, A. Berno, P. Young, R. Sapolsky, G. Ghandour, N. Perkins, E. Winchester, J. Spencer, et al. Large-scale identification, map- ping, and genotyping of single-nucleotide polymorphisms in the human genome. Science, 280:1077–82, 1998.

[9] N. Risch and K. Merikangas. The future of genetic studies of complex human diseases. Science, 273:1516–7, 1996.

[10] The Wellcome Trust Case Control Consortium. Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 447(7145):661–678, Jun 2007.

[11] L.A. Hindorff, P. Sethupathy, H.A. Junkins, E.M. Ramos, J.P. Mehta, F.S. Collins, and T.A. Manolio. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proceedings of the National Academy of Sciences of the United States of America, 106:9362–7, 2009.

[12] T.A. Manolio. Genomewide association studies and assessment of the risk of disease. The New England Journal of Medicine, 363:166–76, 2010.

[13] C.E. Jaquish. The Framingham Heart Study, on its way to becoming the gold standard for Cardiovascular Genetic Epidemiology?. BMC Medical Genetics, 8:63, 2007.

[14] J.N. Hirschhorn and M.J. Daly. Genome-wide association studies for common diseases and complex traits. Nature Reviews. Genetics, 6:95–108, 2005.

[15] D.J. Balding. A tutorial on statistical methods for population association stud- ies. Nature Reviews. Genetics, 7:781–91, 2006.

[16] C.E. Bonferroni. Il calcolo delle assicurazioni su gruppi di teste. In Studi in Onore del Professore Salvatore Ortu Carboni, pages 13–60. 1935. BIBLIOGRAPHY 117

[17] D.E. Reich, M. Cargill, S. Bolk, J. Ireland, P.C. Sabeti, D.J. Richter, T. Lavery, R. Kouyoumjian, S.F. Farhadian, R. Ward, et al. Linkage disequilibrium in the human genome. Nature, 411:199–204, 2001.

[18] R.C. Lewontin and K. Kojima. he evolutionary dynamics of complex polymor- phisms. Evolution, 14:458472, 1960.

[19] P.W. Hedrick. Gametic disequilibrium measures: proceed with caution. Genet- ics, 117:331–41, 1987.

[20] J. Marchini, B. Howie, S. Myers, G. McVean, and P. Donnelly. A new multipoint method for genome-wide association studies by imputation of genotypes. Nature Genetics, 39:906–13, 2007.

[21] P. Donnelly. Progress and challenges in genome-wide association studies in humans. Nature, 456:728–31, 2008.

[22] J.P. Ioannidis, G. Thomas, and M.J. Daly. Validating, augmenting and refining genome-wide association signals. Nature Reviews. Genetics, 10:318–29, 2009.

[23] C. Libioulle, E. Louis, S. Hansoul, C. Sandor, F. Farnir, D. Franchimont, S. Ver- meire, O. Dewit, de Vos M, A. Dixon, et al. Novel Crohn disease locus identified by genome-wide association maps to a gene desert on 5p13.1 and modulates ex- pression of PTGER4. PLoS Genetics, 3:e58 07–PLGE–RA–0108R2 [pii], 2007.

[24] M. Ghoussaini, H. Song, T. Koessler, Al Olama AA, Z. Kote-Jarai, K.E. Driver, K.A. Pooley, S.J. Ramus, S.K. Kjaer, E. Hogdall, et al. Multiple loci with different cancer specificities within the 8q24 gene desert. Journal of the National Cancer Institute, 100:962–6, 2008.

[25] T.A. Manolio, F.S. Collins, N.J. Cox, D.B. Goldstein, L.A. Hindorff, D.J. Hunter, M.I. McCarthy, E.M. Ramos, L.R. Cardon, A. Chakravarti, et al. Find- ing the missing heritability of complex diseases. Nature, 461:747–53, 2009. BIBLIOGRAPHY 118

[26] S.J. Chanock, T. Manolio, M. Boehnke, E. Boerwinkle, D.J. Hunter, G. Thomas, J.N. Hirschhorn, G. Abecasis, D. Altshuler, J.E. Bailey-Wilson, et al. Replicating genotype-phenotype associations. Nature, 447:655–60, 2007.

[27] K. Miclaus, M. Chierici, C. Lambert, L. Zhang, S. Vega, H. Hong, S. Yin, C. Furlanello, R. Wolfinger, and F. Goodsaid. Variability in GWAS analysis: the impact of genotype calling algorithm inconsistencies. The Pharmacogenomics Journal, 10:324–35, 2010.

[28] J. Marchini, C. Spencer, Y. Teo, and P. P. Donnelly. A Bayesian hierarchical mixture model for genotype calling in a multi-cohort study. (in preparation).

[29] J.M. Korn, F.G. Kuruvilla, S.A. McCarroll, A. Wysoker, J. Nemesh, S. Cawley, E. Hubbell, J. Veitch, P.J. Collins, K. Darvishi, et al. Integrated genotype calling and association analysis of SNPs, common copy number polymorphisms and rare CNVs. Nature Genetics, 40:1253–60, 2008.

[30] W.J. Kent, C.W. Sugnet, T.S. Furey, K.M. Roskin, T.H. Pringle, A.M. Zahler, and D. Haussler. The human genome browser at UCSC. Genome Research, 12:996–1006, 2002.

[31] W.J. Kent, R. Baertsch, A. Hinrichs, W. Miller, and D. Haussler. Evolution’s cauldron: duplication, deletion, and rearrangement in the mouse and human genomes. Proceedings of the National Academy of Sciences of the United States of America, 100:11484–9, 2003.

[32] D.R. Rhodes, J. Yu, K. Shanker, N. Deshpande, R. Varambally, D. Ghosh, T. Barrette, A. Pandey, and A.M. Chinnaiyan. Large-scale meta-analysis of cancer microarray data identifies common transcriptional profiles of neoplas- tic transformation and progression. Proceedings of the National Academy of Sciences of the United States of America, 101(25):9309–9314, Jun 2004.

[33] K.I. Goh, M.E. Cusick, D. Valle, B. Childs, M. Vidal, and A.L. Barab´asi.The human disease network. Proceedings of the National Academy of Sciences of the United States of America, 104(21):8685–8690, May 2007. BIBLIOGRAPHY 119

[34] E. Zeggini, L.J. Scott, R. Saxena, B.F. Voight, J.L. Marchini, T. Hu, P.I. de Bakker, G.R. Abecasis, P. Almgren, G. Andersen, et al. Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat Genet, 40(5):638–645, May 2008.

[35] R. Chen, A.A. Morgan, J. Dudley, T. Deshpande, L. Li, K. Kodama, A.P. Chiang, and A.J. Butte. Fitsnps: highly differentially expressed genes are more likely to have variants associated with disease. Genome Biology, 9(12), Dec 2008.

[36] E.Y. Fung, D.J. Smyth, J.M. Howson, J.D. Cooper, N.M. Walker, H. Stevens, L.S. Wicker, and J.A. Todd. Analysis of 17 autoimmune disease-associated variants in type 1 diabetes identifies 6q23/tnfaip3 as a susceptibility locus. Genes Immun, 10(2):188–191, Mar 2009.

[37] A. Torkamani, E.J. Topol, and N.J. Schork. Pathway analysis of seven common diseases assessed by genome-wide association. Genomics, 92(5):265–272, Nov 2008.

[38] J.B. Meigs, P. Shrader, L.M. Sullivan, J.B. McAteer, C.S. Fox, J. Dupuis, A.K. Manning, J.C. Florez, P.W.F. Wilson, R.B. D’Agostino Sr, et al. Genotype Score in Addition to Common Risk Factors for Prediction of Type 2 Diabetes. The New England Journal of Medicine, 359(21):2208, 2008.

[39] C.P. Torfs, M.C. King, B. Huey, J. Malmgren, and F.C. Grumet. Genetic in- terrelationship between insulin-dependent diabetes mellitus, the autoimmune thyroid diseases, and rheumatoid arthritis. American Journal of Human Ge- netics, 38(2):170, 1986.

[40] J.P. Lin, J.M. Cash, S.Z. Doyle, S. Peden, K. Kanik, C.I. Amos, S.J. Bale, and R.L. Wilder. Familial clustering of rheumatoid arthritis with other autoimmune diseases. Human Genetics, 103(4):475–482, 1998.

[41] S. Nejentsev, J.M.M. Howson, N.M. Walker, J. Szeszko, S.F. Field, H.E. Stevens, P. Reynolds, M. Hardy, E. King, J. Masters, et al. Localization of BIBLIOGRAPHY 120

type 1 diabetes susceptibility to the MHC class I genes HLA-B and HLA-A. Nature, 450(7171):887, 2007.

[42] L. Johannessen, U. Strudsholm, L. Foldager, and P. Munk-Jørgensen. Increased risk of hypertension in patients with bipolar disorder and patients with anxiety compared to background population and patients with schizophrenia. Journal of Affective Disorders, 95(1-3):13–17, 2006.

[43] Y.I. Liu, P.H. Wise, and A.J. Butte. The ”etiome”: identification and clustering of human disease etiological factors. BMC Bioinformatics, 10 Suppl 2:S14, 2009.

[44] M. Sirota, M.A. Schaub, S. Batzoglou, W.H. Robinson, and A.J. Butte. Au- toimmune disease classification by inverse association with SNP alleles. PLoS Genetics, 5:e1000792 10.1371/journal.pgen.1000792 [doi], 2009.

[45] L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone. Classification and Regression Trees. Wadsworth. Belmont, CA, 1984.

[46] J.R. Quinlan. Simplifying Decision Trees. AI Memo 930, 1986.

[47] A.H. Sinclair, P. Berta, M.S. Palmer, J.R. Hawkins, B.L. Griffiths, M.J. Smith, J.W. Foster, A.M. Frischauf, R. Lovell-Badge, and P.N. Goodfellow. A gene from the human sex-determining region encodes a protein with homology to a conserved DNA-binding motif. Nature, 346:240–4, 1990.

[48] B.T. Lahn and D.C. Page. Four evolutionary strata on the human X chromo- some. Science, 286:964–7, 1999.

[49] M.C. Simmler, F. Rouyer, G. Vergnaud, M. Nystrom-Lahti, K.Y. Ngo, de la Chapelle A, and J. Weissenbach. Pseudoautosomal DNA sequences in the pair- ing region of the human sex chromosomes. Nature, 317:692–7, 1985.

[50] D. Freije, C. Helms, M.S. Watson, and H. Donis-Keller. Identification of a sec- ond pseudoautosomal region near the Xq and Yq telomeres. Science, 258:1784– 7, 1992. BIBLIOGRAPHY 121

[51] K. Kvaloy, F. Galvagni, and W.R. Brown. The sequence organization of the long arm pseudoautosomal region of the human sex chromosomes. Human Molecular Genetics, 3:771–8, 1994.

[52] P.L. Pearson and M. Bobrow. Definitive evidence for the short arm of the Y chromosome associating with the X chromosome during miosis in the human male. Nature, 226:959–61, 1970.

[53] P.E. Polani. Pairing of X and Y chromosomes, non-inactivation of X-linked genes, and the maleness factor. Human Genetics, 60:207–11, 1982.

[54] P.S. Burgoyne. Genetic homology and crossing over in the X and Y chromosomes of . Human Genetics, 61:85–90, 1982.

[55] L. Kauppi, M. Barchi, F. Baudat, P.J. Romanienko, S. Keeney, and M. Jasin. Distinct properties of the XY pseudoautosomal region crucial for male meiosis. Science, 331:916–20, 2011.

[56] N. Ellis, A. Taylor, B.O. Bengtsson, J. Kidd, J. Rogers, and P. Goodfellow. Pop- ulation structure of the human pseudoautosomal boundary. Nature, 344:663–5, 1990.

[57] N.A. Ellis, P.J. Goodfellow, B. Pym, M. Smith, M. Palmer, A.M. Frischauf, and P.N. Goodfellow. The pseudoautosomal boundary in man is defined by an Alu repeat sequence inserted on the Y chromosome. Nature, 337:81–4, 1989.

[58] N. Ellis, P. Yen, K. Neiswanger, L.J. Shapiro, and P.N. Goodfellow. Evolution of the pseudoautosomal boundary in Old World monkeys and great apes. Cell, 63:977–86, 1990.

[59] C. Mondello, H.H. Ropers, I.W. Craig, E. Tolley, and P.N. Goodfellow. Physical mapping of genes and sequences at the end of the human X chromosome short arm. Annals of human genetics, 51:137–43, 1987.

[60] G.A. Rappold. The pseudoautosomal regions of the human sex chromosomes. Human Genetics, 92:315–24, 1993. BIBLIOGRAPHY 122

[61] C. Mondello, P.J. Goodfellow, and P.N. Goodfellow. Analysis of methylation of a human X located gene which escapes X inactivation. Nucleic Acids Research, 16:6813–24, 1988.

[62] J.D. Mann, A. Cahan, A.G. Gelb, N. Fisher, J. Hamper, P. Tipett, R. Sanger, and R.R. Race. A sex-linked blood group. Lancet, 1:8–10, 1962.

[63] P.J. Goodfellow, C. Pritchard, P. Tippett, and P.N. Goodfellow. Recombination between the X and Y chromosomes: implications for the relationship between MIC2, XG and YG. Annals of Human Genetics, 51:161–7, 1987.

[64] N.A. Ellis, T.Z. Ye, S. Patton, J. German, P.N. Goodfellow, and P. Weller. Cloning of PBDX, an MIC2-related gene that spans the pseudoautosomal boundary on chromosome Xp. Nature Genetics, 6:394–400, 1994.

[65] N.A. Ellis, P. Tippett, A. Petty, M. Reid, P.A. Weller, T.Z. Ye, J. German, P.N. Goodfellow, S. Thomas, and G. Banting. PBDX is the XG blood group gene. Nature Genetics, 8:285–90, 1994.

[66] P.A. Weller, R. Critcher, P.N. Goodfellow, J. German, and N.A. Ellis. The human Y chromosome homologue of XG: transcription of a naturally truncated gene. Human Molecular Genetics, 4:859–68, 1995.

[67] The International HapMap Consortium. A map of human genome variation from population-scale sequencing. Nature, 467:1061–73, 2010.

[68] R.A. Fisher. Statistical Methods for Research Workers. Oliver and Boyd, Ed- inburgh, 1925.

[69] G.H. Hardy. Mendelian proportions in a mixed population. Nature, 28:49–50, 1908.

[70] W. Weinberg. Uber¨ den Nachweis der Vererbung beim Menschen. Jahreshefte des Vereins Varterl¨andischeNaturkdunde in W¨urttemberg, 64:369–382, 1908. BIBLIOGRAPHY 123

[71] B.L. Browning and S.R. Browning. A unified approach to genotype imputa- tion and haplotype-phase inference for large data sets of trios and unrelated individuals. American journal of human genetics, 84:210–23, 2009.

[72] B.O. Bengtsson and P.N. Goodfellow. The effect of recombination between the X and Y chromosomes of mammals. Annals of Human Genetics, 51:57–64, 1987.

[73] A.G. Clark. The evolution of the Y chromosome with X-Y recombination. Genetics, 119:711–20, 1988.

[74] A. Flaquer, G.A. Rappold, T.F. Wienker, and C. Fischer. The human pseu- doautosomal regions: a review for genetic epidemiologists. European Journal of Human Genetics, 16:771–9, 2008.

[75] H.J. Cooke, W.R. Brown, and G.A. Rappold. Hypervariable telomeric sequences from the human sex chromosomes are pseudoautosomal. Nature, 317:687–92, 1985.

[76] F. Rouyer, M.C. Simmler, C. Johnsson, G. Vergnaud, H.J. Cooke, and J. Weis- senbach. A gradient of sex linkage in the pseudoautosomal region of the human sex chromosomes. Nature, 319:291–5, 1986.

[77] P.J. Goodfellow, S.M. Darling, N.S. Thomas, and P.N. Goodfellow. A pseu- doautosomal gene in man. Science, 234:740–3, 1986.

[78] D.C. Page, K. Bieker, L.G. Brown, S. Hinton, M. Leppert, J.M. Lalouel, M. Lathrop, M. Nystrom-Lahti, de la Chapelle A, and R. White. Linkage, physical mapping, and DNA sequence analysis of pseudoautosomal loci on the human X and Y chromosomes. Genomics, 1:243–56, 1987.

[79] J.F. Hughes, H. Skaletsky, T. Pyntikova, T.A. Graves, van Daalen SK, P.J. Minx, R.S. Fulton, S.D. McGrath, D.P. Locke, C. Friedman, et al. Chimpanzee and human Y chromosomes are remarkably divergent in structure and gene content. Nature, 463:536–9, 2010. BIBLIOGRAPHY 124

[80] J.F. Hughes, H. Skaletsky, L.G. Brown, T. Pyntikova, T. Graves, R.S. Fulton, S. Dugan, Y. Ding, C.J. Buhay, C. Kremitzki, et al. Strict evolutionary conser- vation followed rapid gene loss on human and rhesus Y chromosomes. Nature, 483:82–6, 2012.

[81] L.S. Whitfield, R. Lovell-Badge, and P.N. Goodfellow. Rapid sequence evolution of the mammalian sex-determining gene SRY. Nature, 364:713–5, 1993.

[82] R.M. Cox and R. Calsbeek. Sexually antagonistic selection, sexual dimor- phism, and the resolution of intralocus sexual conflict. The American Naturalist, 173:176–87, 2009.

[83] S.P. Otto, J.R. Pannell, C.L. Peichel, T.L. Ashman, D. Charlesworth, A.K. Chippindale, L.F. Delph, R.F. Guerrero, S.V. Scarpino, and B.F. McAllister. About PAR: the distinct evolutionary dynamics of the pseudoautosomal region. Trends in Genetics, 27:358–67, 2011.

[84] N.H. Barton. Genetic hitchhiking. Philosophical transactions of the Royal So- ciety of London. Series B, Biological sciences, 355:1553–62, 2000.

[85] F.J. Charchar, M. Svartman, N. El-Mogharbel, M. Ventura, P. Kirby, M.R. Matarazzo, A. Ciccodicola, M. Rocchi, M. D’Esposito, and J.A. Graves. Com- plex events in the evolution of the human pseudoautosomal region 2 (PAR2). Genome Research, 13:281–6, 2003.

[86] S. Sarbajna, M. Denniff, A.J. Jeffreys, R. Neumann, Soler Artigas M, A. Veselis, and C.A. May. A major recombination hotspot in the XqYq pseudoautosomal region gives new insight into processing of human gene conversion events. Hu- man Molecular Genetics, 21:2029–38, 2012.

[87] Z.H. Rosser, P. Balaresque, and M.A. Jobling. Gene conversion between the X chromosome and the male-specific region of the Y chromosome at a transloca- tion hotspot. American Journal of Human Genetics, 85:130–4, 2009. BIBLIOGRAPHY 125

[88] R.P. Meisel, J.H. Malone, and A.G. Clark. Disentangling the relationship be- tween sex-biased gene expression and X-linkage. Genome Research, 2012.

[89] L.Y. Liu, M.A. Schaub, M. Sirota, and A.J. Butte. Sex differences in disease risk from reported genome-wide association study findings. Human Genetics, 131:353–64, 2012.

[90] L.Y. Liu, M.A. Schaub, M. Sirota, and A.J. Butte. Transmission distortion in Crohn’s disease risk gene ATG16L1 leads to sex difference in disease association. Inflammatory Bowel Diseases, 18:312–22, 2012.

[91] A.D. Paterson, D. Waggott, A. Schillert, C. Infante-Rivard, S.B. Bull, Y.J. Yoo, and D. Pinnaduwage. Transmission-ratio distortion in the Framingham Heart Study. BMC Proceedings, 3 Suppl 7:S51, 2009.

[92] H. Lango Allen, K. Estrada, G. Lettre, S.I. Berndt, M.N. Weedon, F. Ri- vadeneira, C.J. Willer, A.U. Jackson, S. Vedantam, S. Raychaudhuri, et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature, 467:832–8, 2010.

[93] S. Sanna, B. Li, A. Mulas, C. Sidore, H.M. Kang, A.U. Jackson, M.G. Piras, G. Usala, G. Maninchedda, A. Sassu, et al. Fine mapping of five loci associated with low-density lipoprotein cholesterol detects variants that double the ex- plained heritability. PLoS Genetics, 7:e1002198 10.1371/journal.pgen.1002198 [doi], 2011.

[94] P.C. Ng and S. Henikoff. SIFT: Predicting changes that affect protein function. Nucleic Acids Research, 31:3812–4, 2003.

[95] I.A. Adzhubei, S. Schmidt, L. Peshkin, V.E. Ramensky, A. Gerasimova, P. Bork, A.S. Kondrashov, and S.R. Sunyaev. A method and server for predicting dam- aging missense mutations. Nature Methods, 7:248–249, 2010.

[96] S.F. Saccone, R. Bolze, P. Thomas, J. Quan, G. Mehta, E. Deelman, J.A. Tis- chfield, and J.P. Rice. SPOT: a web-based tool for using biological databases to BIBLIOGRAPHY 126

prioritize SNPs after a genome-wide association study. Nucleic Acids Research, 38:W201–9, 2010.

[97] B.E. Stranger, A.C. Nica, M.S. Forrest, A. Dimas, C.P. , C. Beazley, C.E. Ingle, M. Dunning, P. Flicek, D. Koller, et al. Population genomics of human gene expression. Nature Genetics, 39:1217–1224, 2007.

[98] E.E. Schadt, C. Molony, E. Chudin, K. Hao, X. Yang, P.Y. Lum, A. Kasarskis, B. Zhang, S. Wang, C. Suver, et al. Mapping the genetic architecture of gene expression in human liver. PLoS Biology, 6:e107 07–PLBI–RA–4030 [pii], 2008.

[99] H. Zhong, J. Beaulaurier, P.Y. Lum, C. Molony, X. Yang, D.J. Macneil, D.T. Weingarth, B. Zhang, D. Greenawalt, R. Dobrin, et al. Liver and adipose expression associated SNPs are enriched for association to type 2 diabetes. PLoS Genetics, 6:e1000932 10.1371/journal.pgen.1000932 [doi], 2010.

[100] D.L. Nicolae, E. Gamazon, W. Zhang, S. Duan, M.E. Dolan, and N.J. Cox. Trait-associated SNPs are more likely to be eQTLs: annotation to enhance dis- covery from GWAS. PLoS Genetics, 6:e1000888 10.1371/journal.pgen.1000888 [doi], 2010.

[101] Z. Xu and J.A. Taylor. SNPinfo: integrating GWAS and candidate gene infor- mation into functional SNP selection for genetic association studies. Nucleic Acids Research, 37:W600–5, 2009.

[102] G. Macintyre, J. Bailey, I. Haviv, and A. Kowalczyk. is-rSNP: a novel technique for in silico regulatory SNP detection. Bioinformatics, 26:i524–30, 2010.

[103] J.E. Landers, J. Melki, V. Meininger, J.D. Glass, van den Berg LH, van Es MA, P.C. Sapp, van Vught PW, D.M. McKenna-Yasek, H.M. Blauw, et al. Reduced expression of the Kinesin-Associated Protein 3 (KIFAP3) gene increases survival in sporadic amyotrophic lateral sclerosis. Proceedings of the National Academy of Sciences of the United States of America, 106:9004–9, 2009. BIBLIOGRAPHY 127

[104] O. Jarinova, A.F. Stewart, R. Roberts, G. Wells, P. Lau, T. Naing, C. Buerki, B.W. McLean, R.C. Cook, J.S. Parker, et al. Functional analysis of the chro- mosome 9p21.3 coronary artery disease risk locus. Arteriosclerosis, thrombosis, and vascular biology, 29:1671–7, 2009.

[105] G. Robertson, M. Hirst, M. Bainbridge, M. Bilenky, Y. Zhao, T. Zeng, G. Eu- skirchen, B. Bernier, R. Varhol, A. Delaney, et al. Genome-wide profiles of STAT1 DNA association using chromatin immunoprecipitation and massively parallel sequencing. Nature Methods, 4:651–7, 2007.

[106] D.S. Johnson, A. Mortazavi, R.M. Myers, and B. Wold. Genome-wide mapping of in vivo protein-DNA interactions. Science, 316:1497–502, 2007.

[107] D.S. Gross and W.T. Garrard. Nuclease hypersensitive sites in chromatin. Annual Review of Biochemistry, 57:159–97, 1988.

[108] G.E. Crawford, I.E. Holt, J. Whittle, B.D. Webb, D. Tai, S. Davis, E.H. Margulies, Y. Chen, J.A. Bernat, D. Ginsburg, et al. Genome-wide map- ping of DNase hypersensitive sites using massively parallel signature sequencing (MPSS). Genome Research, 16:123–31, 2006.

[109] A.P. Boyle, S. Davis, H.P. Shulha, P. Meltzer, E.H. Margulies, Z. Weng, T.S. Furey, and G.E. Crawford. High-resolution mapping and characterization of open chromatin across the genome. Cell, 132:311–22, 2008.

[110] M. Kasowski, F. Grubert, C. Heffelfinger, M. Hariharan, A. Asabere, S.M. Waszak, L. Habegger, J. Rozowsky, M. Shi, A.E. Urban, et al. Variation in transcription factor binding among humans. Science, 328:232–5, 2010.

[111] H. Lou, M. Yeager, H. Li, J.G. Bosquet, R.B. Hayes, N. Orr, K. Yu, A. Hutchin- son, K.B. Jacobs, P. Kraft, et al. Fine mapping and functional analysis of a common variant in MSMB on chromosome 10q11.2 associated with prostate cancer susceptibility. Proceedings of the National Academy of Sciences of the United States of America, 106:7933–8, 2009. BIBLIOGRAPHY 128

[112] L.G. Carvajal-Carmona, J.B. Cazier, A.M. Jones, K. Howarth, P. Broderick, A. Pittman, S. Dobbins, A. Tenesa, S. Farrington, J. Prendergast, et al. Fine- mapping of colorectal cancer susceptibility loci at 8q23.3, 16q22.1 and 19q13.11: refinement of association signals and use of in silico analysis to suggest func- tional variation and unexpected candidate target genes. Human Molecular Ge- netics, 20:2879–88, 2011.

[113] D.S. Paul, J.P. Nisbet, T.P. Yang, S. Meacham, A. Rendon, K. Hautaviita, J. Tallila, J. White, M.R. Tijssen, S. Sivapalaratnam, et al. Maps of open chromatin guide the functional follow-up of genome-wide association signals: application to hematological traits. PLoS Genetics, 7:e1002139 10.1371/jour- nal.pgen.1002139 [doi], 2011.

[114] O. Harismendy, D. Notani, X. Song, N.G. Rahim, B. Tanasa, N. Heintzman, B. Ren, X.D. Fu, E.J. Topol, M.G. Rosenfeld, et al. 9p21 DNA variants associ- ated with coronary artery disease impair interferon-gamma signalling response. Nature, 470:264–8, 2011.

[115] J. Ernst, P. Kheradpour, T.S. Mikkelsen, N. Shoresh, L.D. Ward, C.B. Epstein, X. Zhang, L. Wang, R. Issner, M. Coyne, et al. Mapping and analysis of chromatin state dynamics in nine human cell types. Nature, 473:43–9, 2011.

[116] The ENCODE Project Consortium. The ENCODE (ENCyclopedia Of DNA Elements) Project. Science, 306:636–40, 2004.

[117] The ENCODE Project Consortium. Identification and analysis of functional elements in 1% of the human genome by the ENCODE pilot project. Nature, 447:799–816, 2007.

[118] The ENCODE Project Consortium. A user’s guide to the encyclopedia of DNA elements (ENCODE). PLoS Biology, 9:e1001046 10.1371/journal.pbio.1001046 [doi], 2011.

[119] The ENCODE Project Consortium. An Integrated Encyclopedia of DNA Ele- ments in the Human Genome. Nature, 2012 (in press). BIBLIOGRAPHY 129

[120] J.R. Hesselberth, X. Chen, Z. Zhang, P.J. Sabo, R. Sandstrom, A.P. Reynolds, R.E. Thurman, S. Neph, M.S. Kuehn, W.S. Noble, et al. Global mapping of protein-DNA interactions in vivo by digital genomic footprinting. Nature Methods, 6:283–9, 2009.

[121] A.P. Boyle, L. Song, B.K. Lee, D. London, D. Keefe, E. Birney, V.R. Iyer, G.E. Crawford, and T.S. Furey. High-resolution genome-wide in vivo footprinting of diverse transcription factors in human cells. Genome Research, 21:456–64, 2011.

[122] R. Pique-Regi, J.F. Degner, A.A. Pai, D.J. Gaffney, Y. Gilad, and J.K. Pritchard. Accurate inference of transcription factor binding from DNA se- quence and chromatin accessibility data. Genome Research, 21:447–55, 2011.

[123] A.P. Boyle, E.L. Hong, M. Hariharan, Y. Cheng, M.A. Schaub, M. Kasowski, K.J. Karczewski, J. Park, B.C. Hitz, S. Weng, et al. Annotation of Functional Variation in Personal Genomes Using RegulomeDB. Genome Research, 2012 (in press).

[124] G.H. Wei, G. Badis, M.F. Berger, T. Kivioja, K. Palin, M. Enge, M. Bonke, A. Jolma, M. Varjosalo, A.R. Gehrke, et al. Genome-wide analysis of ETS- family DNA-binding in vitro and in vivo. The EMBO journal, 29:2147–60, 2010.

[125] M.N. Weedon, H. Lango, C.M. Lindgren, C. Wallace, D.M. Evans, M. Mangino, R.M. Freathy, J.R. Perry, S. Stevens, A.S. Hall, et al. Genome-wide association analysis identifies 20 loci that influence adult height. Nature Genetics, 40:575– 83, 2008.

[126] D.F. Gudbjartsson, G.B. Walters, G. Thorleifsson, H. Stefansson, B.V. Hall- dorsson, P. Zusmanovich, P. Sulem, S. Thorlacius, A. Gylfason, S. Steinberg, et al. Many sequence variants affecting diversity of adult human height. Nature Genetics, 40:609–15, 2008. BIBLIOGRAPHY 130

[127] Y. Okada, Y. Kamatani, A. Takahashi, K. Matsuda, N. Hosono, H. Ohmiya, Y. Daigo, K. Yamamoto, M. Kubo, Y. Nakamura, et al. A genome-wide associ- ation study in 19 633 Japanese subjects identified LHX3-QSOX2 and IGF1 as adult height loci. Human Molecular Genetics, 19:2303–12, 2010.

[128] G. Lettre, A.U. Jackson, C. Gieger, F.R. Schumacher, S.I. Berndt, S. Sanna, S. Eyheramendy, B.F. Voight, J.L. Butler, C. Guiducci, et al. Identification of ten loci associated with height highlights new biological pathways in human growth. Nature Genetics, 40:584–91, 2008.

[129] J.J. Kim, H.I. Lee, T. Park, K. Kim, J.E. Lee, N.H. Cho, C. Shin, Y.S. Cho, J.Y. Lee, B.G. Han, et al. Identification of 15 loci influencing height in a Korean population. Journal of human Genetics, 55:27–31, 2010.

[130] R.A. Eeles, Z. Kote-Jarai, Al Olama AA, G.G. Giles, M. Guy, G. Severi, K. Muir, J.L. Hopper, B.E. Henderson, C.A. Haiman, et al. Identification of seven new prostate cancer susceptibility loci through a genome-wide association study. Nature Genetics, 41:1116–21, 2009.

[131] R. Takata, S. Akamatsu, M. Kubo, A. Takahashi, N. Hosono, T. Kawaguchi, T. Tsunoda, J. Inazawa, N. Kamatani, O. Ogawa, et al. Genome-wide asso- ciation study identifies five new susceptibility loci for prostate cancer in the Japanese population. Nature Genetics, 42:751–4, 2010.

[132] G. Thomas, K.B. Jacobs, M. Yeager, P. Kraft, S. Wacholder, N. Orr, K. Yu, N. Chatterjee, R. Welch, A. Hutchinson, et al. Multiple loci identified in a genome-wide association study of prostate cancer. Nature Genetics, 40:310–5, 2008.

[133] F.R. Schumacher, S.I. Berndt, A. Siddiq, K.B. Jacobs, Z. Wang, S. Lindstrom, V.L. Stevens, C. Chen, A.M. Mondul, R.C. Travis, et al. Genome-wide associ- ation study identifies new prostate cancer susceptibility loci. Human Molecular Genetics, 20:3867–75, 2011. BIBLIOGRAPHY 131

[134] R.A. Eeles, Z. Kote-Jarai, G.G. Giles, A.A. Olama, M. Guy, S.K. Jugurnauth, S. Mulholland, D.A. Leongamornlert, S.M. Edwards, J. Morrison, et al. Multi- ple newly identified loci associated with prostate cancer susceptibility. Nature Genetics, 40:316–21, 2008.

[135] J. Gudmundsson, P. Sulem, V. Steinthorsdottir, J.T. Bergthorsson, G. Thor- leifsson, A. Manolescu, T. Rafnar, D. Gudbjartsson, B.A. Agnarsson, A. Baker, et al. Two variants on chromosome 17 confer prostate cancer risk, and the one in TCF2 protects against type 2 diabetes. Nature Genetics, 39:977–83, 2007.

[136] P.G. Giresi, J. Kim, R.M. McDaniell, V.R. Iyer, and J.D. Lieb. FAIRE (Formaldehyde-Assisted Isolation of Regulatory Elements) isolates active regu- latory elements from human chromatin. Genome Research, 17:877–85, 2007.

[137] C.C. Chung, J. Ciampa, M. Yeager, K.B. Jacobs, S.I. Berndt, R.B. Hayes, J. Gonzalez-Bosquet, P. Kraft, S. Wacholder, N. Orr, et al. Fine mapping of a region of chromosome 11q13 reveals multiple independent loci associated with risk of prostate cancer. Human Molecular Genetics, 20:2869–78, 2011.

[138] G.R. Crabtree and E.N. Olson. NFAT signaling: choreographing the social lives of cells. Cell, 109 Suppl:S67–79, 2002.

[139] J.J. Heit, A.A. Apelqvist, X. Gu, M.M. Winslow, J.R. Neilson, G.R. Crab- tree, and S.K. Kim. Calcineurin/NFAT signalling regulates pancreatic beta-cell growth and function. Nature, 443:345–9, 2006.

[140] K. Hashiba. Hereditary QT prolongation syndrome in Japan: genetic analy- sis and pathological findings of the conducting system. Japanese Circulation Journal, 42:1133–50, 1978.

[141] E.H. Locati, W. Zareba, A.J. Moss, P.J. Schwartz, G.M. Vincent, M.H. Lehmann, J.A. Towbin, S.G. Priori, C. Napolitano, J.L. Robinson, et al. Age- and sex-related differences in clinical manifestations in patients with congenital long-QT syndrome: findings from the International LQTS Registry. Circulation, 97:2237–44, 1998. BIBLIOGRAPHY 132

[142] M. Nakagawa, T. Ooie, N. Takahashi, Y. Taniguchi, F. Anan, H. Yonemochi, and T. Saikawa. Influence of menstrual cycle on QT interval dynamics. Pacing and Clinical Electrophysiology, 29:607–13, 2006.

[143] A.H. Kadish, P. Greenland, M.C. Limacher, W.H. Frishman, S.A. Daugherty, and J.B. Schwartz. Estrogen and progestin use and the QT interval in post- menopausal women. Annals of Noninvasive Electrocardiology, 9:366–74, 2004.

[144] M. Gokce, B. Karahan, R. Yilmaz, C. Orem, C. Erdol, and S. Ozdemir. Long term effects of hormone replacement therapy on heart rate variability, QT in- terval, QT dispersion and frequencies of arrhythmia. International Journal of Cardiology, 99:373–9, 2005.

[145] J.G. Sutcliffe, P.B. Hedlund, E.A. Thomas, F.E. Bloom, and B.S. Hilbush. Peripheral reduction of beta-amyloid is sufficient to reduce brain beta-amyloid: implications for Alzheimer’s disease. Journal of Neuroscience Research, 89:808– 14, 2011.

[146] G.N. Filippova, S. Fagerlie, E.M. Klenova, C. Myers, Y. Dehner, G. Goodwin, P.E. Neiman, S.J. Collins, and V.V. Lobanenkov. An exceptionally conserved transcriptional repressor, CTCF, employs different combinations of zinc fingers to bind diverged sequences of avian and mammalian c-myc oncogenes. Molecular and Cellular Biology, 16:2802–13, 1996.

[147] A.C. Bell and G. Felsenfeld. Methylation of a CTCF-dependent boundary con- trols imprinted expression of the Igf2 gene. Nature, 405:482–5, 2000.

[148] T.H. Kim, Z.K. Abdullaev, A.D. Smith, K.A. Ching, D.I. Loukinov, R.D. Green, M.Q. Zhang, V.V. Lobanenkov, and B. Ren. Analysis of the vertebrate insulator protein CTCF-binding sites in the human genome. Cell, 128:1231–45, 2007.

[149] R. Ohlsson, R. Renkawitz, and V. Lobanenkov. CTCF is a uniquely versatile transcription regulator linked to epigenetics and disease. Trends in Genetics, 17:520–7, 2001. BIBLIOGRAPHY 133

[150] Z.Q. Wang, M.R. Fung, D.P. Barlow, and E.F. Wagner. Regulation of embry- onic growth and lysosomal targeting by the imprinted Igf2/Mpr gene. Nature, 372:464–7, 1994.

[151] A.S. Van Laere, M. Nguyen, M. Braunschweig, C. Nezer, C. Collette, L. Moreau, A.L. Archibald, C.S. Haley, N. Buys, M. Tally, et al. A regulatory mutation in IGF2 causes a major QTL effect on muscle growth in the pig. Nature, 425:832–6, 2003.

[152] P.B. Grino, J.E. Griffin, and J.D. Wilson. Testosterone at high concentrations interacts with the human androgen receptor similarly to dihydrotestosterone. Endocrinology, 126:1165–72, 1990.

[153] T. Visakorpi, E. Hyytinen, P. Koivisto, M. Tanner, R. Keinanen, C. Palmberg, A. Palotie, T. Tammela, J. Isola, and O.P. Kallioniemi. In vivo amplification of the androgen receptor gene and progression of human prostate cancer. Nature Genetics, 9:401–6, 1995.

[154] N. Craft, Y. Shostak, M. Carey, and C.L. Sawyers. A mechanism for hormone- independent prostate cancer through modulation of androgen receptor signaling by the HER-2/neu tyrosine kinase. Nature Medicine, 5:280–5, 1999.

[155] N. Sharifi, J.L. Gulley, and W.L. Dahut. Androgen deprivation therapy for prostate cancer. JAMA, 294:238–44, 2005.

[156] E.W. Sayers, T. Barrett, D.A. Benson, E. Bolton, S.H. Bryant, K. Canese, V. Chetvernin, D.M. Church, M. Dicuccio, S. Federhen, et al. Database re- sources of the National Center for Biotechnology Information. Nucleic Acids Research, 2011.

[157] T. Derrien, R. Johnson, G. Bussotti, A. Tanzer, S. Djebali, H. Tilgner, G. Guernec, D. Martin, A. Merkel, D. Gonzalez, et al. The GENCODE v7 catalogue of human long non-coding RNAs: Analysis of their gene structure, evolution and expression. Genome Research, 2012 (in press). BIBLIOGRAPHY 134

[158] Q. Yang, A. Kottgen, A. Dehghan, A.V. Smith, N.L. Glazer, M.H. Chen, D.I. Chasman, T. Aspelund, G. Eiriksdottir, T.B. Harris, et al. Multiple genetic loci influence serum urate levels and their relationship with gout and cardiovascular disease risk factors. Circulation. Cardiovascular genetics, 3:523–30, 2010.

[159] D.P. McGovern, M.R. Jones, K.D. Taylor, K. Marciante, X. Yan, M. Dubinsky, A. Ippoliti, E. Vasiliauskas, D. Berel, C. Derkowski, et al. Fucosyltransferase 2 (FUT2) non-secretor status is associated with Crohn’s disease. Human Molec- ular Genetics, 19:3468–76, 2010.

[160] J.C. Barrett, S. Hansoul, D.L. Nicolae, J.H. Cho, R.H. Duerr, J.D. Rioux, S.R. Brant, M.S. Silverberg, K.D. Taylor, M.M. Barmada, et al. Genome-wide association defines more than 30 distinct susceptibility loci for Crohn’s disease. Nature Genetics, 40:955–62, 2008.

[161] I.M. Heid, A.U. Jackson, J.C. Randall, T.W. Winkler, L. Qi, V. Steinthorsdot- tir, G. Thorleifsson, M.C. Zillikens, E.K. Speliotes, R. Magi, et al. Meta-analysis identifies 13 new loci associated with waist-hip ratio and reveals sexual dimor- phism in the genetic basis of fat distribution. Nature Genetics, 42:949–60, 2010.

[162] S.K. Ganesh, N.A. Zakai, van Rooij FJ, N. Soranzo, A.V. Smith, M.A. Nalls, M.H. Chen, A. Kottgen, N.L. Glazer, A. Dehghan, et al. Multiple loci influ- ence erythrocyte phenotypes in the CHARGE Consortium. Nature Genetics, 41:1191–8, 2009.

[163] C. Newton-Cheh, M. Eijgelsheim, K.M. Rice, de Bakker PI, X. Yin, K. Estrada, J.C. Bis, K. Marciante, F. Rivadeneira, P.A. Noseworthy, et al. Common vari- ants at ten loci influence QT interval duration in the QTGEN Study. Nature Genetics, 41:399–406, 2009.

[164] A.D. Johnson, L.R. Yanek, M.H. Chen, N. Faraday, M.G. Larson, G. Tofler, S.J. Lin, A.T. Kraja, M.A. Province, Q. Yang, et al. Genome-wide meta-analyses identifies seven loci associated with platelet aggregation in response to agonists. Nature Genetics, 42:608–13, 2010. BIBLIOGRAPHY 135

[165] D.M. Dick, F. Aliev, R.F. Krueger, A. Edwards, A. Agrawal, M. Lynskey, P. Lin, M. Schuckit, V. Hesselbrock, Nurnberger J Jr, et al. Genome-wide association study of conduct disorder symptomatology. Molecular Psychiatry, 16:800–8, 2011.

[166] A. Franke, D.P. McGovern, J.C. Barrett, K. Wang, G.L. Radford-Smith, T. Ah- mad, C.W. Lees, T. Balschun, J. Lee, R. Roberts, et al. Genome-wide meta- analysis increases to 71 the number of confirmed Crohn’s disease susceptibility loci. Nature Genetics, 42:1118–25, 2010.

[167] J.C. Barrett, D.G. Clayton, P. Concannon, B. Akolkar, J.D. Cooper, H.A. Er- lich, C. Julier, G. Morahan, J. Nerup, C. Nierras, et al. Genome-wide associa- tion study and meta-analysis find that over 40 loci affect risk of type 1 diabetes. Nature Genetics, 41:703–7, 2009.

[168] D. Melzer, J.R. Perry, D. Hernandez, A.M. Corsi, K. Stevens, I. Rafferty, F. Lau- retani, A. Murray, J.R. Gibbs, G. Paolisso, et al. A genome-wide associa- tion study identifies protein quantitative trait loci (pQTLs). PLoS Genetics, 4:e1000072 10.1371/journal.pgen.1000072 [doi], 2008.

[169] R.S. Houlston, E. Webb, P. Broderick, A.M. Pittman, Di Bernardo MC, S. Lubbe, I. Chandler, J. Vijayakrishnan, K. Sullivan, S. Penegar, et al. Meta- analysis of genome-wide association data identifies four new susceptibility loci for colorectal cancer. Nature Genetics, 40:1426–35, 2008.

[170] P. Hollingworth, D. Harold, R. Sims, A. Gerrish, J.C. Lambert, M.M. Car- rasquillo, R. Abraham, M.L. Hamshere, J.S. Pahwa, V. Moskvina, et al. Com- mon variants at ABCA7, MS4A6A/MS4A4E, EPHA1, CD33 and CD2AP are associated with Alzheimer’s disease. Nature Genetics, 43:429–35, 2011.

[171] R.J. Klein, C. Zeiss, E.Y. Chew, J.Y. Tsai, R.S. Sackler, C. Haynes, A.K. Henning, J.P. SanGiovanni, S.M. Mane, S.T. Mayne, et al. Complement factor H polymorphism in age-related macular degeneration. Science, 308:385–9, 2005. BIBLIOGRAPHY 136

[172] P.C. Dubois, G. Trynka, L. Franke, K.A. Hunt, J. Romanos, A. Curtotti, A. Zh- ernakova, G.A. Heap, R. Adany, A. Aromaa, et al. Multiple common variants for celiac disease influencing immune gene expression. Nature Genetics, 42:295– 302, 2010.

[173] A. Dehghan, Q. Yang, A. Peters, S. Basu, J.C. Bis, A.R. Rudnicka, M. Kavousi, M.H. Chen, J. Baumert, G.D. Lowe, et al. Association of novel genetic Loci with circulating fibrinogen levels: a genome-wide association study in 6 population- based cohorts. Circulation. Cardiovascular genetics, 2:125–33, 2009.

[174] P.F. McArdle, A. Parsa, Y.P. Chang, M.R. Weir, J.R. O’Connell, B.D. Mitchell, and A.R. Shuldiner. Association of a common nonsynonymous variant in GLUT9 with serum uric acid levels in old order amish. Arthritis and Rheuma- tism, 58:2874–81, 2008.

[175] J.D. Reveille, A.M. Sims, P. Danoy, D.M. Evans, P. Leo, J.J. Pointon, R. Jin, X. Zhou, L.A. Bradbury, L.H. Appleton, et al. Genome-wide association study of ankylosing spondylitis identifies non-MHC susceptibility loci. Nature Genet- ics, 42:123–7, 2010.

[176] X. Sim, R.T. Ong, C. Suo, W.T. Tay, J. Liu, D.P. Ng, M. Boehnke, K.S. Chia, T.Y. Wong, M. Seielstad, et al. Transferability of type 2 diabetes implicated loci in multi-ethnic cohorts from Southeast Asia. PLoS Genetics, 7:e1001363 10.1371/journal.pgen.1001363 [doi], 2011.

[177] J.N. Painter, C.A. Anderson, D.R. Nyholt, S. Macgregor, J. Lin, S.H. Lee, A. Lambert, Z.Z. Zhao, F. Roseman, Q. Guo, et al. Genome-wide associa- tion study identifies a locus at 7p15.2 associated with endometriosis. Nature Genetics, 43:51–4, 2011.

[178] D.M. Waterworth, S.L. Ricketts, K. Song, L. Chen, J.H. Zhao, S. Ripatti, Y.S. Aulchenko, W. Zhang, X. Yuan, N. Lim, et al. Genetic variants influencing circulating lipid levels and risk of coronary artery disease. Arteriosclerosis, thrombosis, and vascular biology, 30:2264–76, 2010. BIBLIOGRAPHY 137

[179] S. Kathiresan, O. Melander, C. Guiducci, A. Surti, N.P. Burtt, M.J. Rieder, G.M. Cooper, C. Roos, B.F. Voight, A.S. Havulinna, et al. Six new loci asso- ciated with blood low-density lipoprotein cholesterol, high-density lipoprotein cholesterol or triglycerides in humans. Nature Genetics, 40:189–97, 2008.

[180] S. Gretarsdottir, A.F. Baas, G. Thorleifsson, H. Holm, den Heijer M, de Vries JP, S.E. Kranendonk, C.J. Zeebregts, van Sterkenburg SM, R.H. Geelk- erken, et al. Genome-wide association study identifies a sequence variant within the DAB2IP gene conferring susceptibility to abdominal aortic aneurysm. Na- ture Genetics, 42:692–7, 2010.

[181] J.W. Han, H.F. Zheng, Y. Cui, L.D. Sun, D.Q. Ye, Z. Hu, J.H. Xu, Z.M. Cai, W. Huang, G.P. Zhao, et al. Genome-wide association study in a Chinese Han population identifies nine new susceptibility loci for systemic lupus erythemato- sus. Nature Genetics, 41:1234–7, 2009.

[182] G. Lettre, C.D. Palmer, T. Young, K.G. Ejebe, H. Allayee, E.J. Benjamin, F. Bennett, D.W. Bowden, A. Chakravarti, A. Dreisbach, et al. Genome- wide association study of coronary heart disease and its risk factors in 8,090 African Americans: the NHLBI CARe Project. PLoS Genetics, 7:e1001300 10.1371/journal.pgen.1001300 [doi], 2011.

[183] K.A. Hunt, A. Zhernakova, G. Turner, G.A. Heap, L. Franke, M. Bruinenberg, J. Romanos, L.C. Dinesen, A.W. Ryan, D. Panesar, et al. Newly identified genetic risk variants for celiac disease related to the immune response. Nature Genetics, 40:395–402, 2008.

[184] K.S. Wang, X.F. Liu, and N. Aragam. A genome-wide meta-analysis identifies novel loci associated with schizophrenia and bipolar disorder. Schizophrenia Research, 124:192–9, 2010. BIBLIOGRAPHY 138

[185] R.C. Kaplan, A.K. Petersen, M.H. Chen, A. Teumer, N.L. Glazer, A. Doring, C.S. Lam, N. Friedrich, A. Newman, M. Muller, et al. A genome-wide associa- tion study identifies novel loci associated with circulating IGF-I and IGFBP-3. Human Molecular Genetics, 20:1241–51, 2011.

[186] C.E. Elks, J.R. Perry, P. Sulem, D.I. Chasman, N. Franceschini, C. He, K.L. Lunetta, J.A. Visser, E.M. Byrne, D.L. Cousminer, et al. Thirty new loci for age at menarche identified by a meta-analysis of genome-wide association studies. Nature Genetics, 42:1077–85, 2010.

[187] I.J. Kullo, K. Ding, H. Jouni, C.Y. Smith, and C.G. Chute. A genome-wide association study of red blood cell traits using the electronic medical record. PloS One, 5, 2010.

[188] V. Enciso-Mora, P. Broderick, Y. Ma, R.F. Jarrett, H. Hjalgrim, K. Hemminki, van den Berg A, B. Olver, A. Lloyd, S.E. Dobbins, et al. A genome-wide association study of Hodgkin’s lymphoma identifies new susceptibility loci at 2p16.1 (REL), 8q24.21 and 10p14 (GATA3). Nature Genetics, 42:1126–30, 2010.

[189] J.L. Stein, X. Hua, S. Lee, A.J. Ho, A.D. Leow, A.W. Toga, A.J. Saykin, L. Shen, T. Foroud, N. Pankratz, et al. Voxelwise genome-wide association study (vGWAS). NeuroImage, 53:1160–74, 2010.

[190] P.S. Wild, T. Zeller, A. Schillert, S. Szymczak, C.R. Sinning, A. Deiseroth, R.B. Schnabel, E. Lubos, T. Keller, M.S. Eleftheriadis, et al. A genome-wide association study identifies LIPA as a susceptibility gene for coronary artery disease. Circulation. Cardiovascular Genetics, 4:403–12, 2011.

[191] N.J. Samani, J. Erdmann, A.S. Hall, C. Hengstenberg, M. Mangino, B. Mayer, R.J. Dixon, T. Meitinger, P. Braund, H.E. Wichmann, et al. Genomewide association analysis of coronary artery disease. The New England Journal of Medicine, 357:443–53, 2007.

[192] H. Schunkert, I.R. Konig, S. Kathiresan, M.P. Reilly, T.L. Assimes, H. Holm, M. Preuss, A.F. Stewart, M. Barbalic, C. Gieger, et al. Large-scale association BIBLIOGRAPHY 139

analysis identifies 13 new susceptibility loci for coronary artery disease. Nature Genetics, 43:333–8, 2011.

[193] Coronary Artery Disease (C4D) Genetics Consortium. A genome-wide associa- tion study in Europeans and South Asians identifies five new loci for coronary artery disease. Nature Genetics, 43:339–44, 2011.

[194] S. Kathiresan, B.F. Voight, S. Purcell, K. Musunuru, D. Ardissino, P.M. Man- nucci, S. Anand, J.C. Engert, N.J. Samani, H. Schunkert, et al. Genome-wide association of early-onset myocardial infarction with single nucleotide polymor- phisms and copy number variants. Nature Genetics, 41:334–41, 2009.

[195] S.J. Furney, A. Simmons, G. Breen, I. Pedroso, K. Lunnon, P. Proitsi, A. Hodges, J. Powell, L.O. Wahlund, I. Kloszewska, et al. Genome-wide associ- ation with MRI atrophy measures as a quantitative trait locus for Alzheimer’s disease. Molecular Psychiatry, 16:1130–8, 2011.

[196] Y.S. Aulchenko, S. Ripatti, I. Lindqvist, D. Boomsma, I.M. Heid, P.P. Pram- staller, B.W. Penninx, A.C. Janssens, J.F. Wilson, T. Spector, et al. Loci in- fluencing lipid levels and coronary heart disease risk in 16 European population cohorts. Nature Genetics, 41:47–55, 2009.

[197] O.M. Albagha, S.E. Wani, M.R. Visconti, N. Alonso, K. Goodman, M.L. Brandi, T. Cundy, P.Y. Chung, R. Dargie, J.P. Devogelaer, et al. Genome- wide association identifies three new susceptibility loci for Paget’s disease of bone. Nature Genetics, 43:685–9, 2011.

[198] O.M. Albagha, M.R. Visconti, N. Alonso, A.L. Langston, T. Cundy, R. Dargie, M.G. Dunlop, W.D. Fraser, M.J. Hooper, G. Isaia, et al. Genome-wide asso- ciation study identifies variants at CSF1, OPTN and TNFRSF11A as genetic risk factors for Paget’s disease of bone. Nature Genetics, 42:520–4, 2010.

[199] L.R. Trevino, W. Yang, D. French, S.P. Hunger, W.L. Carroll, M. Devidas, C. Willman, G. Neale, J. Downing, S.C. Raimondi, et al. Germline genomic BIBLIOGRAPHY 140

variants associated with childhood acute lymphoblastic leukemia. Nature Ge- netics, 41:1001–5, 2009.

[200] A.C. Naj, G. Jun, G.W. Beecham, L.S. Wang, B.N. Vardarajan, J. Buros, P.J. Gallins, J.D. Buxbaum, G.P. Jarvik, P.K. Crane, et al. Common variants at MS4A4/MS4A6E, CD2AP, CD33 and EPHA1 are associated with late-onset Alzheimer’s disease. Nature Genetics, 43:436–41, 2011.

[201] K. Estrada, M. Krawczak, S. Schreiber, van Duijn K, L. Stolk, J.B. van Meurs, F. Liu, B.W. Penninx, J.H. Smit, N. Vogelzangs, et al. A genome-wide associ- ation study of northwestern Europeans involves the C-type natriuretic peptide signaling pathway in the etiology of human height variation. Human Molecular Genetics, 18:3516–24, 2009.

[202] Lango Allen H, K. Estrada, G. Lettre, S.I. Berndt, M.N. Weedon, F. Ri- vadeneira, C.J. Willer, A.U. Jackson, S. Vedantam, S. Raychaudhuri, et al. Hundreds of variants clustered in genomic loci and biological pathways affect human height. Nature, 467:832–8, 2010.

[203] R. Anney, L. Klei, D. Pinto, R. Regan, J. Conroy, T.R. Magalhaes, C. Correia, B.S. Abrahams, N. Sykes, A.T. Pagnamenta, et al. A genome-wide scan for common alleles affecting risk for autism. Human Molecular Genetics, 19:4072– 82, 2010.

[204] C. Newton-Cheh, T. Johnson, V. Gateva, M.D. Tobin, M. Bochud, L. Coin, S.S. Najjar, J.H. Zhao, S.C. Heath, S. Eyheramendy, et al. Genome-wide association study identifies eight loci associated with blood pressure. Nature Genetics, 41:666–76, 2009.

[205] H. Nan, M. Xu, J. Zhang, M. Zhang, P. Kraft, A.A. Qureshi, C. Chen, Q. Guo, F.B. Hu, E.B. Rimm, et al. Genome-wide association study identifies nidogen 1 (NID1) as a susceptibility locus to cutaneous nevi and melanoma risk. Human Molecular Genetics, 20:2673–9, 2011. BIBLIOGRAPHY 141

[206] T. Yamauchi, K. Hara, S. Maeda, K. Yasuda, A. Takahashi, M. Horikoshi, M. Nakamura, H. Fujita, N. Grarup, S. Cauchi, et al. A genome-wide association study in the Japanese population identifies susceptibility loci for type 2 diabetes at UBE2E2 and C2CD4A-C2CD4B. Nature Genetics, 42:864–8, 2010.

[207] N. Grarup, M. Overvad, T. Sparso, D.R. Witte, C. Pisinger, T. Jorgensen, T. Yamauchi, K. Hara, S. Maeda, T. Kadowaki, et al. The diabetogenic VPS13C/C2CD4A/C2CD4B rs7172432 variant impairs glucose-stimulated in- sulin response in 5,722 non-diabetic Danish individuals. Diabetologia, 54:789–94, 2011.

[208] X.O. Shu, J. Long, Q. Cai, L. Qi, Y.B. Xiang, Y.S. Cho, E.S. Tai, X. Li, X. Lin, W.H. Chow, et al. Identification of new genetic risk variants for type 2 diabetes. PLoS Genetics, 6, 2010.

[209] J. Dupuis, C. Langenberg, I. Prokopenko, R. Saxena, N. Soranzo, A.U. Jackson, E. Wheeler, N.L. Glazer, N. Bouatia-Naji, A.L. Gloyn, et al. New genetic loci implicated in fasting glucose homeostasis and their impact on type 2 diabetes risk. Nature Genetics, 42:105–16, 2010.

[210] R. Saxena, M.F. Hivert, C. Langenberg, T. Tanaka, J.S. Pankow, P. Vollen- weider, V. Lyssenko, N. Bouatia-Naji, J. Dupuis, A.U. Jackson, et al. Genetic variation in GIPR influences the glucose and insulin responses to an oral glucose challenge. Nature Genetics, 42:142–8, 2010.

[211] M.C. Lawrence, H.S. Bhatt, and R.A. Easom. NFAT regulates insulin gene promoter activity in response to synergistic pathways induced by glucose and glucagon-like peptide-1. Diabetes, 51:691–8, 2002.

[212] P.C. Sabeti, D.E. Reich, J.M. Higgins, H.Z. Levine, D.J. Richter, S.F. Schaffner, S.B. Gabriel, J.V. Platko, N.J. Patterson, G.J. McDonald, et al. Detecting recent positive selection in the human genome from haplotype structure. Nature, 419:832–7, 2002. BIBLIOGRAPHY 142

[213] J.V. Neel. Diabetes mellitus: a ”thrifty” genotype rendered detrimental by ”progress”?. American Journal of Human Genetics, 14:353–62, 1962.

[214] A. Helgadottir, G. Thorleifsson, A. Manolescu, S. Gretarsdottir, T. Blondal, A. Jonasdottir, A. Jonasdottir, A. Sigurdsson, A. Baker, A. Palsson, et al. A common variant on chromosome 9p21 affects the risk of myocardial infarction. Science, 316:1491–3, 2007.

[215] R. McPherson. Chromosome 9p21 and coronary artery disease. The New Eng- land Journal of Medicine, 362:1736–7, 2010.

[216] R. Saxena, B.F. Voight, V. Lyssenko, N.P. Burtt, de Bakker PI, H. Chen, J.J. Roix, S. Kathiresan, J.N. Hirschhorn, M.J. Daly, et al. Genome-wide associa- tion analysis identifies loci for type 2 diabetes and triglyceride levels. Science, 316:1331–6, 2007.

[217] H.M. Broadbent, J.F. Peden, S. Lorkowski, A. Goel, H. Ongen, F. Green, R. Clarke, R. Collins, M.G. Franzosi, G. Tognoni, et al. Susceptibility to coro- nary artery disease and diabetes is encoded by distinct, tightly linked SNPs in the ANRIL locus on chromosome 9p. Human Molecular Genetics, 17:806–14, 2008.

[218] Y. Hiura, Y. Fukushima, M. Yuno, H. Sawamura, Y. Kokubo, T. Okamura, H. Tomoike, Y. Goto, H. Nonogi, R. Takahashi, et al. Validation of the as- sociation of genetic variants on chromosome 9p21 and 1q41 with myocardial infarction in a Japanese population. Circulation journal, 72:1213–7, 2008.

[219] K. Hinohara, T. Nakajima, M. Takahashi, S. Hohda, T. Sasaoka, K. Nakahara, K. Chida, M. Sawabe, T. Arimura, A. Sato, et al. Replication of the associ- ation between a chromosome 9p21 polymorphism and coronary artery disease in Japanese and Korean populations. Journal of Human Genetics, 53:357–9, 2008.

[220] G.Q. Shen, S. Rao, N. Martinelli, L. Li, O. Olivieri, R. Corrocher, K.G. Abdul- lah, S.L. Hazen, J. Smith, J. Barnard, et al. Association between four SNPs on BIBLIOGRAPHY 143

chromosome 9p21 and myocardial infarction is replicated in an Italian popula- tion. Journal of Human Genetics, 53:144–50, 2008.

[221] H. Ding, Y. Xu, X. Wang, Q. Wang, L. Zhang, Y. Tu, J. Yan, W. Wang, R. Hui, C.Y. Wang, et al. 9p21 is a shared susceptibility locus strongly for coronary artery disease and weakly for ischemic stroke in Chinese Han population. Cir- culation. Cardiovascular genetics, 2:338–46, 2009.

[222] G. Badis, M.F. Berger, A.A. Philippakis, S. Talukder, A.R. Gehrke, S.A. Jaeger, E.T. Chan, G. Metzler, A. Vedenko, X. Chen, et al. Diversity and complexity in DNA recognition by transcription factors. Science, 324:1720–3, 2009.

[223] R. McPherson, A. Pertsemlidis, N. Kavaslar, A. Stewart, R. Roberts, D.R. Cox, D.A. Hinds, L.A. Pennacchio, A. Tybjaerg-Hansen, A.R. Folsom, et al. A common allele on chromosome 9 associated with coronary heart disease. Science, 316:1488–91, 2007.

[224] A. Visel, Y. Zhu, D. May, V. Afzal, E. Gong, C. Attanasio, M.J. Blow, J.C. Cohen, E.M. Rubin, and L.A. Pennacchio. Targeted deletion of the 9p21 non- coding coronary artery disease risk interval in mice. Nature, 464:409–12, 2010.

[225] M.S. Cunnington, Santibanez Koref M, B.M. Mayosi, J. Burn, and B. Keavney. Chromosome 9p21 SNPs Associated with Multiple Disease Phenotypes Cor- relate with ANRIL Expression. PLoS Genetics, 6:e1000899 10.1371/jour- nal.pgen.1000899 [doi], 2010.

[226] X.Y. Fu, D.S. Kessler, S.A. Veals, D.E. Levy, and Darnell JE Jr. ISGF3, the transcriptional activator induced by interferon alpha, consists of multiple interacting polypeptide chains. Proceedings of the National Academy of Sciences of the United States of America, 87:8555–9, 1990.

[227] K. Silander, H. Tang, S. Myles, E. Jakkula, N.J. Timpson, L. Cavalli-Sforza, and L. Peltonen. Worldwide patterns of haplotype diversity at 9p21.3, a locus associated with type 2 diabetes and coronary heart disease. Genome Medicine, 1:51, 2009. BIBLIOGRAPHY 144

[228] T.L. Assimes, J.W. Knowles, A. Basu, C. Iribarren, A. Southwick, H. Tang, D. Absher, J. Li, J.M. Fair, G.D. Rubin, et al. Susceptibility locus for clinical and subclinical coronary artery disease at chromosome 9p21 in the multi-ethnic ADVANCE study. Human Molecular Genetics, 17:2320–8, 2008.

[229] B.G. Kral, R.A. Mathias, B. Suktitipat, I. Ruczinski, D. Vaidya, L.R. Yanek, A.A. Quyyumi, R.S. Patel, A.M. Zafari, V. Vaccarino, et al. A common variant in the CDKN2B gene on chromosome 9p21 protects against coronary artery disease in Americans of African ancestry. Journal of Human Genetics, 56:224– 9, 2011.

[230] K. Yamagishi, A.R. Folsom, W.D. Rosamond, and E. Boerwinkle. A genetic variant on chromosome 9p21 and incident heart failure in the ARIC study. European Heart Journal, 30:1222–8, 2009.

[231] B. Pasaniuc, N. Zaitlen, G. Lettre, G.K. Chen, A. Tandon, W.H. Kao, I. Ruczinski, M. Fornage, D.S. Siscovick, X. Zhu, et al. Enhanced statisti- cal tests for GWAS in admixed populations: assessment using African Ameri- cans from CARe and a Breast Cancer Consortium. PLoS Genetics, 7:e1001371 10.1371/journal.pgen.1001371 [doi], 2011.

[232] The IBC 50K CAD Consortium. Large-scale gene-centric analysis identi- fies novel variants for coronary artery disease. PLoS Genetics, 7:e1002260 10.1371/journal.pgen.1002260 [doi], 2011.

[233] J.C. Barrett, B. Fry, J. Maller, and M.J. Daly. Haploview: analysis and visual- ization of LD and haplotype maps. Bioinformatics, 21:263–5, 2004.

[234] G.E. Crooks, G. Hon, J.M. Chandonia, and S.E. Brenner. WebLogo: a sequence logo generator. Genome Research, 14:1188–90, 2004.