Addressing the Missing Heritability of Coronary Artery Disease

DISSERTATION

Presented in Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy in the Graduate School of The Ohio State University

By

Katherine Louise Seal Hartmann

Biomedical Sciences Graduate Program

The Ohio State University

2016

Dissertation Committee:

Wolfgang Sadee, Advisor

Rebecca Jackson

Joseph Kitzmiller

Ray Hershberger

Copyrighted by

Katherine Louise Seal Hartmann

2016

Abstract

There is often a strong genetic component to human health and disease – reflected in the importance of an individual’s family history in clinical practice. This is true for coronary artery disease (CAD), with heritability estimates calculated by comparing concordance between monozygotic and dizygotic twins ranging from 40 to

60%. Sequencing of the in the early 2000s led to the advent of

Genome Wide Association Studies (GWAS) that test millions of variants across the genome for an association with disease. These large-scale studies have identified 58 loci associated with risk of CAD; yet each variant contributes relatively little to disease risk (odds ratios range from 1.01 to 1.92) and in total they account for only about 10% of the genetic risk. The remainder of the genetic risk remains unexplained, a phenomenon termed ‘missing heritability’. This work aims to address this ‘missing heritability’ by considering the potential for non-linear interactions contributing to

CAD risk.

The following chapters are aimed at uncovering disease relevant interactions.

The same technological advancements that allowed for sequencing of the human genome have facilitated large-scale tissue-specific RNAsequencing efforts (GTEx), genome-wide scans for methylation and other signatures of regulatory elements

(ENCODE), population specific catalogs of genetic variants and linkage disequilibrium ii (1000 genomes), and characterization of clinical phenotypes in dozens of GWA and other clinical studies, each built from thousands of samples. With all of these being publicly available, the question addressed in part in the following chapters is how to integrate such diverse datasets to inform our understanding of the role of genetic variants in disease.

Each chapter uses these resources to evaluate potential interactions, whether among variants, genes, or pathways. The first chapter serves as an introduction. The second is a survey of genes implicated in CAD by GWAS. Expression is used as a guide to prioritize the relevant gene locus for each GWAS variant, revealing loci with multiple -coding and non-coding transcripts and expression patterns that cannot be accounted for by a single functional haplotype. The third chapter considers differential expression of GWAS-based candidate genes in the context of disease.

Gene expression, which incorporates both genetic and environmental factors, is used as a guide to search for interactions, revealing non-additive effects between gene expression patterns that are associated with risk of myocardial infarction. The next two chapters both focus on specific gene loci, allowing greater resolution to probe molecular mechanisms in detail. Chapter four considers the nicotinic receptor locus consisting of three genes: CHRNA5, CHRNA3, and CHRNB4 that each form a subunit of the nicotinic receptor. I identify three main signals (rs16969968, rs88053, and rs1948) associated with gene expression, which form diplotypes that better account for risk nicotine dependence than individual variants, thus demonstrating the importance of interactions. The final chapter reveals a potential role of Cholesteryl Ester Transfer

Protein (CETP), a drug target for CAD, in immune-related functions based on iii expression and co-expression patterns. Three candidate variants are considered, two with opposing effects on expression (rs247616 and rs6498863) and the third associated with splicing (rs5883). Each of these studies supports the role for non-linear, non- additive interactions contributing to coronary artery disease.

iv

Dedicated to my Dad for believing that for me ‘not even the sky is the limit.’ Such sincere faith in my abilities has made me dream bigger and strive for more than I feel

possible.

v

Acknowledgments

Many thanks to Wolfgang for being what can only be described as an inspiration for how joyous life and science can be. I would also like to thank Michał

Seweryn for seeing in me something no one else did, not even myself, and for helping me to find a rhythm in science and in life that I expect will lead to a great many adventures.

To my brother Justin and his wife Emily, thank you for always being a source of love and for the inspiration you provide in how you both live your lives so fully and with such passion. To my Mom and her husband Don, thanks for your constant love, support and interest. A special thanks to Amanda & Andy Campbell and Steven,

Allison, Evie, Eric, and Joey Scoville for the ‘three amigos’ family.

To all other professors and students who have talked with and helped me along the way, my sincere gratitude.

vi

Vita

May 2006 ...... Brookwood High School

May 2010 ...... B.A. Biology, Cornell University

2010 to present ...... MSTP Trainee, The Ohio State University

Publications

J. P. Kitzmiller, P. F. Binkley, S. R. Pandey, A. M. Suhy, D. Baldassarre, and K.

Hartmann, “Statin pharmacogenomics: pursuing biomarkers for predicting clinical outcomes.,” Discov. Med., vol. 16, no. 86, pp. 45–51, Aug. 2013.

A. Suhy, K. Hartmann, L. Newman, A. Papp, T. Toneff, V. Hook, and W. Sadee,

“Genetic variants affecting of human cholesteryl ester transfer protein.,” Biochem. Biophys. Res. Commun., vol. 443, no. 4, pp. 1270–4, Jan. 2014.

W. Sadee, K. Hartmann, M. Seweryn, M. Pietrzak, S. K. Handelman, and G. A.

Rempala, “Missing heritability of common diseases and treatments outside the protein- coding exome.,” Hum. Genet., vol. 133, no. 10, pp. 1199–215, Oct. 2014.

vii R. Lu, R. M. Smith, M. Seweryn, D. Wang, K. Hartmann, A. Webb, W. Sadee, and G.

A. Rempala, “Analyzing allele specific RNA expression using mixture models.,” BMC

Genomics, vol. 16, no. 1, p. 566, Jan. 2015.

S. K. Handelman, M. Seweryn, R. M. Smith, K. Hartmann, D. Wang, M. Pietrzak, A.

D. Johnson, A. Kloczkowski, and W. Sadee, “Conditional entropy in variation-adjusted windows detects selection signatures associated with expression quantitative trait loci

(eQTLs).,” BMC Genomics, vol. 16 Suppl 8, p. S8, Jan. 2015.

A. Suhy, K. Hartmann, A. C. Papp, D. Wang, and W. Sadee, “Regulation of cholesteryl ester transfer protein expression by upstream polymorphisms: reduced expression associated with rs247616.,” Pharmacogenet. Genomics, Jun. 2015.

F. Makadia, P. P. Mehta, C. E. Wisely, J. E. Santiago-Torres, K. Hartmann, M. J.

Welker, and D. Habash, “Creating and Completing Service-Learning within Medical

School Curricula: From the Learner’s Perspective,” International Journal of Medical

Students, vol. 3, no. 1. 31-Aug-2015.

P. P. Mehta, J. E. Santiago-Torres, C. E. Wisely, K. Hartmann, F. A. Makadia, M. J.

Welker, and D. L. Habash, “Primary Care Continuity Improves Diabetic Health

Outcomes: From Free Clinics to Federally Qualified Health Centers.,” J. Am. Board

Fam. Med., vol. 29, no. 3, pp. 318–24, May-June 2016.

viii Fields of Study

Major Field: Biomedical Sciences Graduate Program IBGP

ix

Table of Contents

Abstract ...... ii

Acknowledgments ...... vi

Vita ...... vii

Publications ...... vii

Fields of Study ...... ix

Table of Contents ...... x

List of Tables ...... xii

List of Figures ...... xiv

Chapter 1: Introduction to the Genetics of Common Disease ...... 1

Chapter 2: A Survey of Genes Implicated in Coronary Artery Disease ...... 12

Chapter 3: Non-linear Interactions Between Candidate Genes of Myocardial Infarction

Revealed in mRNA Expression Profiles ...... 39

Chapter 4: The CHRNA5/CHRNA3/CHRNB4 nicotinic receptor regulome: genomic architecture, regulatory variants, and clinical associations ...... 73

Chapter 5: Genetic variation in the cholesteryl ester transfer protein locus ...... 104

x Chapter 6: Concluding Remarks ...... 117

References ...... 124

Appendix A: Ethics Approval ...... 143

Appendix B: Funding Sources ...... 144

Appendix C: Availability of data and materials ...... 148

Appendix D: Abbreviations ...... 149

Appendix E: Supplemental Figures ...... 150

Appendix F: Supplemental Tables ...... 163

Appendix G: Introduction of network construction algorithm ...... 193

Appendix H: GTEx Tissues Color Key ...... 197

xi

List of Tables

Table 1. Number of distinct signals associated with expression for genes implicated by

GWAS and reprioritized by eQTL analysis...... 24

Table 2. Additional haplotype for LIPA expression in whole blood improves model for gene expression but not MI...... 26

Table 3. Comparison of additive models with and without non-linear interactions between candidate gene pairs in CATHGEN...... 48

Table 4. Interactions between candidate gene pairs in the Framingham Heart Study...... 50

Table 5. Tissue specific expression profile of genes in nicotinic receptor locus...... 77

Table 6. Analysis of variance (ANOVA) comparing ability of models with different

SNP combinations to account for nicotine dependence...... 92

Table 7. Linear models show protective effect for rs880395 and risk for both rs1948 and rs16969968 for nicotidne dependence (ND)...... 92

Table 8. Association between candidate SNPs and CETP expression in GTEx and

CATHGEN...... 113

Table 9. Association between candidate SNPs and MI in CATHGEN ...... 113

Table 10. Annotation and gene information for GWAS variants...... 164

Table 11: GO Cluster Details...... 177

Table 12. GWAS-based candidate genes...... 183 xii Table 13. Epistatic pairs...... 187

Table 14. CHRNA5-enhancer locus...... 188

Table 15. Evidence sources for candidate SNPs in nicotinic receptor locus...... 191

Table 16. SNPs marking functional haplotypes and minor allele frequencies (MAFs) from different 1000 Genomes populations...... 192

xiii

List of Figures

Figure 1. Simulation of relationship between linkage disequilibrium to functional SNP

(R2) and association with disease state...... 6

Figure 2. Example locus (LIPA) – GWAS assigned genes have multiple protein coding and non-coding transcripts on both the sense and antisense strands...... 15

Figure 3. Summary of CAD GWAS gene loci...... 17

Figure 4. Clustering of CAD GWAS genes by shared GO terms...... 20

Figure 5. Number of distinct signals associated with eQTLs...... 23

Figure 6. Three distinct signals associated with LIPA eQTLs in whole blood...... 26

Figure 7. Linkage disequilibrium between LIPA eQTLs in all tissues reveals multiple distinct haplotypes...... 29

Figure 8. Local manhattan plot for LIPA in whole blood suggests presence of transcript specific eQTLs...... 31

Figure 9. Differentially expressed candidate genes in MI...... 45

Figure 10. Differentially expressed candidate gene pairs in MI...... 47

Figure 11. Odds ratios and conditional odds ratios of MI for non-linearly interacting pairs...... 52

Figure 12. Tissue-specific co-expression networks...... 54

Figure 13. Introduction of the network construction algorithm...... 55 xiv Figure 14. Co-expression as a filter for disease-relevant interactions...... 56

Figure 15. Shortest paths between candidate genes in large-scale co-expression network...... 57

Figure 16. Differentially expressed candidate genes in hypercholesterolemia...... 59

Figure 17. Differentially expressed candidate gene pairs in hypercholesterolemia...... 60

Figure 18. Replication of differentially expressed genes...... 62

Figure 19. CHRNA5/CHRNA3/CHRNB4 nicotinic receptor locus and analysis approach...... 75

Figure 20. Tissue-specific mRNA co-expression...... 78

Figure 21. Overlap between eQTLs for CHRNA3, CHRNA5 and RP11-650L12.2 in nucleus accumbens (A) and skeletal muscle (B) as reported by GTEx...... 80

Figure 22. Local Manhattan plot of eQTLs in nicotinic receptor locus...... 81

Figure 23. LD accounts for CHRNA3 eQTL strength...... 83

Figure 24. Interaction between nonsynonymous CHRNA5 SNP rs16969968 and

CHRNA5-enhancer variant rs880395 in determining gene expression...... 86

Figure 25. LD between candidate SNPs with expression, regulatory, and clinical associations annotated...... 88

Figure 26. Association of nicotine dependence with rs16969968 genotype as well as haplotypes and diplotypes comprised of: rs16969968. rs880395 and rs1948 reported as odds ratio with 95% confidence intervals...... 93

Figure 27. CETP best expressed in Spleen...... 108

Figure 28. CETP expression profile clusters with immune-related family members LBP and BPI...... 108 xv Figure 29. CETP is co-expressed with immune-related genes...... 109

Figure 30. Candidate SNPs in CETP gene locus...... 111

Figure 31. Tissue specific expression of APOA1, APOC1, ALDH2, CYP17A1, and

MFGE8...... 150

Figure 32. Number of distinct signals associated with eQTLs...... 151

Figure 33. Linkage disequilibrium between LIPA eQTLs in whole blood reveals multiple distinct haplotypes...... 154

Figure 34. Distribution of p-values...... 155

Figure 35. LD of nicotinic receptor locus in YRI and CEU...... 156

Figure 36. LD assocation with eQTL strength for CHRNA5, RP-11, and CHRB4. ... 157

Figure 37. CHRNA3 eQTLs replicate in Braineac...... 158

Figure 38. Allelic expression ratios for nicotinic receptor locus...... 159

Figure 39. LD between candidate SNPs in Africans with expression, regulatory, and clinical associations annotated...... 160

Figure 40. Association between GWAS p-value and linkage disequilibrium (R2) with top scoring SNP indicates multiple functional haplotypes in CETP locus...... 162

xvi

Chapter 1: Introduction to the Genetics of Common Disease

Genetics accounts for the inheritance of traits from one generation to the next, transmitted through DNA. Like many others, I have opted to focus on the sequence of

DNA nucleotides as the heritable unit, although it is becoming increasingly clear that other factors like methylation also play a role (Kaminsky et al., 2009). The importance of genetics in biology cannot be overstated as this forms the basis on which natural selection acts to shape the evolution of traits over time.

In biomedical science, there is often a strong genetic component to human health and disease – reflected in the importance of an individual’s family history in clinical practice. The insight of genetics into disease is powerful for two reasons.

First, it is fixed - an individual’s DNA sequence does not change from cell to cell or throughout their lifetime.1 This means that it can be measured once and in a tissue that is convenient to sample (i.e. blood) and the information remains relevant throughout an individual’s life.

1 This assumes that we are considering ‘classical’ forms of inheritance – alterations in DNA sequence such as Single Nucleotide Polymorphisms (SNPs), Copy Number Variants (CNVs), or insertions and deletions (indels). Genetic factors such as methylation have shown flexibility across an individual’s lifespan. Cancer would also be an exception – SNPs, CNVs, and indels can rapidly develop in a population of cells. Although the sequence of DNA does not change (with the exceptions noted), the effect of any given variant is highly context dependent, often varying between tissues. 1 This also means that it measures lifetime risk. That is, if a variant is associated with high blood pressure it reflects prolonged exposure to high blood pressure2. This is all in contrast to other biomarkers like lipid levels, blood pressure, or blood glucose levels that require annual, or more frequent, measurements to inform the clinical state of a patient and give insight only into a single moment in time. For this reason genetic variants are well suited as biomarkers.

The second reason that genetics is a powerful tool in human health and disease is it is in a way causative (or at the very least contributory to development of disease).

If disease occurs when a particular gene is mutated, that gene is likely important for disease pathology and a good target for drug design. Here, PCSK9 serves as an example. PCSK9 was implicated in hyperlipidemia and cardiovascular disease by gain of function mutations tied to familial hypercholesterolemia and loss of function mutations associated with reduced LDL and cardiovascular disease (Elguindy &

Yacoub, 2013). Mechanistic studies revealed that PCSK9 triggers degradation of the

LDL receptor, such that increased activity reduces LDL uptake by the liver resulting in hyperlipidemia (Seidah, Awan, Chrétien, & Mbikay, 2014). PCSK9 inhibitors are now one of the newest drugs on the market for treatment of hyperlipidemia in prevention of cardiovascular disease.

With genetics having the potential to provide significant insight into who is at risk for disease and how to develop novel therapeutics, the question becomes which genetic variants matter? The human genome is composed of 3 billion nucleotides with

2 An exception would be if the genetic variant acts during a particular stage of life (i.e. during development) or only in the context of a specific environmental factor (i.e. smoking). 2 each individual harboring 4-5 million variants, the vast majority of which are single nucleotide polymorphisms (SNPs) or short insertion/deletions that are common in the population (minor allele frequency > 5%) (Auton et al., 2015). To date nearly 100 million variants have been catalogued in the public database (dbSNP).

Common diseases such as coronary artery disease (CAD) have a strong and very apparent genetic component emphasized by how these disorders are known to ‘run in families’. Twin studies that compare the concordance of disease between monozygotic and dizygotic twins are a means of quantifying this heritability. For coronary artery disease, these studies have determined that the genetic risk of disease ranges between 40 and 60% (Zdravkovic et al., 2002).

Pedigree analysis and candidate gene studies have been used to identify a number of genetic variants important in CAD. In addition, sequencing of the human genome in the early 2000s led to the advent of Genome Wide Association Studies

(GWAS) in which the association of genetic variants with disease on a large, genome- wide scale (~5 million SNPs) became technologically feasible. Although not as agnostic as it may at first appear due to bias in selecting variants for analysis, the broad scope of GWAS makes it well-suited to identifying novel genes and pathways involved in disease. For CAD, vascular smooth muscle contractile function, cytoskeleton rearrangement, cell adhesion, matrix interactions, serotonin and calcium signaling, inflammation, and lipid metabolism are all pathways that have been implicated

(Deloukas et al., 2013; Torkamani, Topol, & Schork, 2008). CAD GWAS variants associated with lipid levels and hypertension in addition to CAD highlight the role of both processes in disease pathogenesis, while GWAS findings unrelated to any 3 currently known risk factor suggests there are a number of pathways contributing to disease that remain unknown (Roberts, 2014).

Although candidate gene, family linkage, and GWA studies have identified a number of variants and genes implicated in disease, family history still does a significantly better job of accounting for disease risk - 26.3% of heritability accounted for by complete family history, while only 6.9% is accounted for by SNPs (Do, Hinds,

Francke, & Eriksson, 2012). With known variants accounting for only 6-15% of risk, the remainder is ‘missing’ (Deloukas et al., 2013; Do et al., 2012; H. Tada et al., 2014).

This phenomenon, shared among virtually almost all polygenic, common, complex diseases is termed ‘missing heritability.’ As our inability to account for genetic risk of disease offers significant potential in developing novel therapeutics and personalized approaches to healthcare, the ‘case of the missing heritability’ has been widely discussed and a number of theories proposed.

Rare variants

GWAS, like any other association test, is susceptible to biases, with SNPs that are more frequent and/or with larger, more stable effect sizes being more easily detected. Rare variants of moderate effect can easily fall under the radar, especially given the sample size of many studies. In the most recent CAD GWAS, rare variants were imputed using 1000 genomes and found to account for only 2% of heritability, indicating genetic risk of CAD is unlikely to be explained by as of yet unidentified rare variants (Nikpay et al., 2015). Larger sample sizes will most certainly uncover additional SNPs; however it is my opinion that they will add only incrementally to our ability to account for the genetic risk of disease. Larger advances may come from 4 considering alternate models that test multiple variants in an integrated fashion instead of evaluating each variant one at a time.

Structural variants

GWAS as well as candidate gene and family linkage studies have predominately focused on single nucleotide polymorphisms. However, structural variants such as insertions and deletions and copy number variants may also contribute to disease risk (Sudmant et al., 2015). Further structural considerations like chromatin architecture and methylation are also likely to be important contributors to heritability.

Our understanding of these structural changes and our ability to address them from a technical standpoint is still in its infancy.

Marker SNPs

Genotyping arrays used in GWA studies cover anywhere from a few dozen to 5 million SNPs, far short of the 100 million SNPS identified thus far (Auton et al., 2015).

SNPs included as markers on a genotyping array are chosen based on their ability to carry information about the surrounding region through linkage disequilibrium. The idea being that in measuring say 1 million SNPs you acquire information about the entire genome. This however means that in many cases GWAS will not detect the functional variant, but rather a marker SNP. If a functional variant is not adequately represented by any given marker SNP, it will fail to be detected by GWAS.

Furthermore, the effect sizes estimated from these markers will be distorted, deviating more as the marker is a less valid proxy for the functional variant. Heritability estimates will be accordingly skewed based on these estimations; although based on

5 simulations it seems deviations for well-linked marker SNPs (R2 > 0.8) will be mild

(Figure 1).

Figure 1. Simulation of relationship between linkage disequilibrium to functional SNP (R2) and association with disease state. (A) Functional SNP was set to have an odds ratio (OR) of 2 (red line). Each dot represents the estimated OR when a marker SNP is considered. As linkage disequilibrium declines, the estimated OR both declines and shows increased variability. In B and C, the functional interaction (SNP1*SNP2) was set to have an effect size of 1.2 (red line). Each dot represents the estimated effect size with marker SNP substituted for SNP1 (B) or for SNP2 (C). Deviation from the true value and variability in point estimate is greater for the interaction term (B and C) than for the individual SNP (A).

Epistasis

Individual SNPs rarely occur in isolation. Instead, they are inherited in blocks or haplotypes, so why would just one SNP affect disease? In model organisms, there are clear examples of epistasis or SNP-SNP interactions both within and between genes

(C. Li et al., 2016).

Large-scale efforts to detect epistasis have largely failed, presumably due to the computational challenges of testing every possible pairwise combination between

6 millions of SNPs. Lippert et. al attempted a genome-wide scan for epistasis using the

Wellcome Trust data comprised of seven common disease phenotypes. By pooling disease cases from other phenotypes as additional controls, Lippert et al. were able to increase their statistical power, but still did not find compelling evidence for epistatic interactions (Lippert et al., 2013). Even more so than a single SNP analysis, identifying epistatic pairs is hindered by consideration of maker SNPs rather than functional variants (Figure 1).

One strategy in identifying eptistatic pairs is to reduce the number of hypotheses by prioritizing genes and SNPs that are more likely to interact with one another either based either on statistical parameters (e.g. the SNPs are individually significant) or prior biological knowledge. A number of bioinformatics approaches have been employed to integrate available sources of information; for example,

Biofilter, which has been successfully used to identify a handful of epistatic interactions (Sun et al., 2014), incorporates data from Reactome, KEGG, Gene

Ontology, DIP, PFAM, and Ensembl to generate multi-SNP models based on previous knowledge about gene-gene and gene-disease relationships (Bush, Dudek, & Ritchie,

2009). Such approaches have the benefit of generating models that are easily interpretable, but are limited by publication biases and restricted to known relationships

(Ritchie, 2011).

Complex phenotypes

As with many complex diseases, all cases of CAD are not the same. There are varying degrees of severity and different clinical presentations. Some of these distinctions almost certainly come with different genetic risk factors. Accordingly, one 7 complication to genetic studies and GWAS in particular is mistaken pooling of diverse phenotypes. For CAD in particular, this does not seem to be a substantial limitation as similar results were obtained despite variability in criteria to assign the phenotype.

(Nikpay et al., 2015).

Each of these theories likely contributes to the missing heritability, although available evidence suggests for CAD in particular rare variants and a mixture of phenotypes contributes relatively little. The work presented here aims to address the role of non-linear interactions in disease risk, with RNA expression being used as guide.

Although still rather removed from a clinical phenotype, gene expression is a more ‘proximal’ phenotype than a SNP and represents integration of genetic and environmental factors. It is highly regulated– canalized so that expression is neither too high, nor too low. This is accomplished through multiple layers of regulation, all of which may have one or more genetic factors. Non-coding RNAs (ncRNAs) are increasingly being recognized for their role in this regulation. Expression is highly tissue specific, which can present a significant challenge in genetic expression studies, as the relevant variant may not be identified only because the relevant tissue is not considered. Databases like the Genotype-Tissue Expression Project (GTEx) that provides tissue specific data helps to overcome this challenge.

Much attention has been focused on nonsynonymous coding variants, those that change the sequence of a protein, in part because they are easiest to detect and interpret. However regulatory variants, those that alter the quantity of RNA may be of equal or greater importance. With over 90% of GWAS variants falling into 8 regulatory regions, they are strongly implicated in disease (Hindorff et al., 2009).

Although they do not directly alter protein sequence, regulatory variants have significant potential to alter biological processes and affect cellular functions by altering quantity. The effects of regulatory variants may be subtle and thus not under strong negative selection as would be expected for deleterious protein-coding variants.

Accordingly regulatory variants may be more likely to act in combination and to exhibit a robust level of variation that can be acted on by evolution.

Gene expression represents a means by which to isolate and understand the effects of individual genetic variants. Functional variants, almost regardless of their mechanism alter expression, although changes may often be small and remain below the threshold for detection. Variants that alter transcription factor binding or chromatin looping - altering the proximity of long-range enhancers and accessibility of the DNA -

, those that alter the rate of transcription changing efficiency of splicing for example, or those that change the folding and hence stability of RNA would all trigger changes in

RNA levels. Gene expression, or rather co-expression, can also inform a search for gene-gene interactions. Co-expression offers some insight into what genes function together as it is expected that genes that regulate one another, are regulated by the same factor, physically interact with one another, or participate in a shared pathway would be co-expressed with one another.

The following chapters are aimed at uncovering disease relevant interactions.

The same technological advancements that allowed for sequencing of the human genome have facilitated large-scale tissue-specific RNAsequencing efforts (GTEx), genome-wide scans for methylation and other signatures of regulatory elements 9 (ENCODE), population specific catalogs of genetic variants and linkage disequilibrium

(1000 genomes), and characterization of clinical phenotypes in dozens of GWA and other clinical studies, each built from thousands of samples. With all of these being publicly available, the question addressed in part in the following chapters is how to integrate such diverse datasets to inform our understanding of the role of genetic variants in disease.

Each chapter uses these resources to evaluate potential interactions, whether among variants, genes, or pathways. The second chapter is a survey of genes implicated in coronary artery disease by GWAS. Expression is used as a guide to prioritize the relevant gene locus for each GWAS variant, revealing loci with multiple protein-coding and non-coding transcripts and expression patterns that cannot be accounted for by a single functional variant. The third chapter considers differential expression of GWAS-based candidate genes in the context of disease. It identifies non- additive effects between gene expression patterns that are associated with risk of MI.

The forth and fifth chapters both evaluate specific gene loci. The forth considers the nicotinic receptor locus, which consists of three genes: CHRNA5, CHRNA3, and

CHRNB4 that each form subunits of the nicotinic receptor. We identify three main signals (rs16969968, rs88053, and rs1948) associated with gene expression, which form diplotypes that better account for risk nicotine dependence than individual variants, demonstrating the importance of interactions. The fifth chapter reveals a potential role for Cholesteryl Ester Transfer Protein (CETP), a drug target for CAD, in immune-related functions based on expression and co-expression patterns. It considers three candidate variants, two having opposing effects on expression (rs247616 and 10 rs6498863) and one associated with splicing (rs5883). Each of these studies supports the role for non-linear, non-additive interactions contributing to disease.

11

Chapter 2: A Survey of Genes Implicated in Coronary Artery Disease

Background

Genome-wide association studies (GWAS) have identified dozens of genetic variants (SNPs) associated with cardiovascular disease risk and related clinical phenotypes (e.g. blood pressure, lipid levels, etc.). However, for the most part, the functional variant and even the relevant gene remain unknown with the nearest gene often being reported. Recent advances in large-scale RNA expression analysis, for example the Genotype and issue Expression Project (GTEx) with thousands of tissue specific RNAsequencing measures, provide a wealth of information about the tissue- specific relationship between genetic variants and expression of surrounding genes.

Integration of these expression quantitative trait loci (eQTLs) with GWAS results can guide our interpretation of the latter. For example, the evidence that a particular GWAS hit is relevant for a gene becomes quite compelling if it is associated with expression of that gene. In a recent paper, Braenne et al. used eQTLs, among other criteria, to re-assign

GWAS hits; although they did not make use of the GTEx database, a particularly valuable resource given evidence that CAD is attributable to tissue-specific regulatory effects (Brænne et al., 2015; Won et al., 2015).

Further complicating interpretation of GWAS results, because GWAS reports statistical associations for SNPs covered by a given genotyping platform it is unlikely to 12 detect functional variants. Instead GWAS is optimized to detect markers that are genotyped in the study. Thus the particular variant(s) detected by GWAS depend on a combination of effect size (influenced by both how well the SNP serves as a proxy for the functional variant and how large of an effect this functional variant has on the clinical outcome), minor allele frequency, sample size, population structure, and the genotyping platform. Some of this is mitigated by more recent GWAS that use 1000 genomes to impute indels, rare variants, and common variants not included on the genotyping array

(Nikpay et al., 2015). However, it is still more likely that GWAS will detect markers rather than functional variants and in very few instances do we understand how a given

GWAS variant increases risk of disease.

Although GWAS have implicated several dozen variants in the risk of CAD, additive models incorporating the individual effects of these variants account for only about 10% of genetic risk of disease (Hayato Tada et al., 2014). Considering marker

SNPs rather than functional variants distorts the estimated effect size for each variant and may account for a portion of the heritability that remains unaccounted for by models that sum the individual effects of these marker SNPs. Failed searches for epistasis may also be explained by use of marker rather than functional variants as the power to detect epistatic interactions declines rapidly with decreasing linkage disequilibrium to the functional variant.

To interpret and refine GWAS findings for coronary artery disease, we begin by evaluating the complexity of each gene locus in terms of the number of annotated isoforms, non-coding RNAs in the region, and variability in expression from tissue to tissue. We then cluster these genes using GO annotations, a framework which may help 13 in understanding how these genes together contribute to coronary artery disease.

Supplementing work begun by Braenne et al. in reassigning candidate genes, we use eQTLs reported by GTEx to determine genes likely to be relevant. For each of these

‘validated’ genes, we then consider the number of signals needed to account for the observed eQTLs. In many instances we find evidence for multiple functional haplotype contributing to variability in gene expression. We also consider the potential for transcript specific eQTLs and for eQTLs associated with variance rather than the absolute level of gene expression. These efforts aim to guide our understanding of how gene expression and GWAS data maybe integrated to guide our interpretation of the latter.

Results

GWAS-based candidate genes

The most recent CAD GWAS used 1000 genomes to impute insertions/deletions, rare variants and common variants that were not directly genotyped. While confirming

47 of the 48 previously identified loci, they found an additional 10, bringing the total count of CAD associated loci to 58 (Nikpay et al., 2015). Each of these variants has then been assigned one or more genes based largely on proximity.

The genomic loci for genes assigned to these variants are rarely simple. They often harbor multiple protein-coding and non-coding transcripts arranged on both the sense and antisense strands (Figure 2). Among the 107 genes that are either annotated by

GWAS or appear as eGenes for GWAS variants, there is an average of 7 transcripts per gene and a maximum of 32 (BCAS3 – rs7212798), with one-third of all transcripts being 14 non-coding (Figure 3). More than half of the genes (59) have one or more genes antisense to them (i.e. located on the opposite strand and overlapping). The complexity of these loci emphasizes the need for a combination of targeted molecular studies and clever computational approaches to determine the relevant transcript(s). With a median of 68 publications and a maximum of 22,658 (ABO), a handful of these genes are significantly better studied than the rest (Figure 3). Given nearly 20% of genes do not have a single paper indexed in PubMed, it is clear that many targeted studies have yet been completed.

Figure 2. Example locus (LIPA) – GWAS assigned genes have multiple protein coding and non-coding transcripts on both the sense and antisense strands. With 8 isoforms and 6 transcripts antisense to it, LIPA serves as an example of the complexity often found in these loci.

The vast majority of GWAS variants fall in non-coding regions where they are primed to modify regulatory elements. Regulatory elements, especially enhancers, tend 15 to be highly tissue specific. To distinguish those genes that are universally expressed across different tissue types from those with tissue-specific expression patterns, we considered the variance of RPKM expression values across tissues (Figure 3). APOA1,

CYP7A1, APOC1, MFGE8, and ALDH2 demonstrate highly tissue specific expression, with APOA1, APOC1, and ALDH2 being predominately expressed in liver, CYP7A1 predominately expressed in adrenal gland, and MFGE8 in artery (Figure 31). As the liver is critical to lipid metabolism and dyslipidemia is a risk factor for CAD, it is not surprising to find genes with liver specific expression patterns. Perhaps what is surprising is that many of the genes show relatively little variance across tissues, suggesting they have more universal patterns of expression. Thus an important step in interpreting GWAS variants and their associated genes will be determining the relevant tissue or tissues.

16

Figure 3. Summary of CAD GWAS gene loci. For each of the 107 genes that have either been annotated by GWAS or appear as eGenes for GWAS variants, the number of transcripts (isoforms) assigned to the gene (A), the number of isoforms antisense to the gene (B), the number of publications indexed by PubMed as of June 15, 2016 (C), and the variance in expression (measured as RPKM) across 54 tissues reported by GTEx.

To evaluate how candidate genes relate to one another and how they may interact to contribute to coronary artery disease, we identified GO terms associated with each

17 gene and clustered genes based on the number of shared GO terms. The resulting heatmap produced twelve clusters (Figure 4, Table 11). The first, second, and tenth clusters include many of the genes with well-documented roles in lipid metabolism

(Cluster #1: APOA1, APOA5, APOB, LDLR, LPL; Cluster #2: APOC1, LPA, PCSK9,

PLG; Cluster #10: ABCG5, ABCG8, ATP2B1, SLC22A3/A4/A5); while the division of the broader category into 3 separate clusters suggests distinct mechanisms – all in Cluster #10 for example are integral membrane transporters. Cluster #11 also includes genes with a shared function: SMAD3, HDAC9, and REST are all transcription factors.

The grouping of genes into units with similar functions suggests such this may be a means of identifying shared mechanisms among candidate genes. Further work identifying whether genes within a cluster interact physically, are components of a shared pathway, or perhaps substitute for one another could provide substantial insight into how these genes in combination contribute to CAD risk.

19

Figure 4. Clustering of CAD GWAS genes by shared GO terms. Genes assigned to GWAS variants are listed on both the x and y axis with the color of each box denoting the number of GO terms in common between the two genes. Red indicates a greater number of shared terms (max = 21), orange indicates fewer shared terms (min = 0). Clusters are labeled by numbered circles plotted on top of the dendrogram. Note for 19 genes (block of solid orange in the middle of the plot falling into cluster #8) there are no GO terms assigned. 20

Re-prioritizing relevant genes using eQTLs

To reassign variants identified by GWAS to those genes most likely to be relevant for the observed association, we queried each SNP in the results of tissue specific eQTLs released by GTEx. For each SNP, we identified those genes (if any) for which the variant served as an eQTL (eGenes). We compared these eGenes to the annotation reported in the GWAS Catalog, re-prioritizing those that were previously annotated and identifying new genes as well (Table 10). For twelve loci, this eQTL-based reprioritization pointed to a single gene: rs932344 (ADTRP), rs72689147 (RP11-588K22.2 previously annotated as GUCY1A3), rs10139550 (HHIPL), rs56336142 (KCNK5), rs2487928 (KIAA1462), rs1412444 (LIPA), rs55730499 (SLC22A3 previously annotated as SLC22A3-LPAL2-

LPA), rs17087335 (RP11-738E22.3 previously annotated as REST-NOA1), rs11065979

(ALDH2 previously annotated as SH2B3), rs56062135 (SMAD3), rs10840293 (SWAP70), rs12202017 (TCF21).

Sixteen loci were associated with expression of more than one gene. These multi- gene eQTLs suggest that a single SNP maybe important in regulating multiple genes in the locus, for example by chromatin looping. It is also possible that strong co-expression between genes may limit our ability to resolve the link between SNP and gene by eQTL associations.

The remaining GWAS loci (31) did not show any association with expression of nearby genes. Given GWAS variants are expected to mark functional haplotypes rather than themselves being functional, there would be reduced power to detect a significant association between the GWAS variant and gene expression. Accordingly, we tested 21 SNPs in moderate-strong linkage disequilbirum (R2 > 0.8, dprime > 0.8) with each of the

31 GWAS variants for their status as an eQTL, expanding the number of SNPs considered from 31 to 690. None of these variants were identified as eQTLs by GTEx, suggesting these haplotypes are indeed not associated with expression of nearby genes.

These loci may have more subtle or timing specific effects on local gene expression that remain undetectable in GTEx, may affect gene expression in trans (not measured by

GTEx), or may mark haplotypes that alter disease risk without rendering significant changes in expression at the RNA level.

Identifying number of signals accounting for observed eQTLs

For each of the 12 genes where the eQTL-based reprioritization of GWAS loci pointed to a single gene and we have greater confidence in the gene assignment, we find dozens of other eQTLs. To evaluate how the GWAS variant compares to other SNPs in the locus also associated with gene expression and to determine if there is any evidence for an additional functional haplotype in the locus, we plot the effect size (beta) for each eQTL by its linkage disequilibrium with the top scoring (most significant p-value) eQTL in a given tissue. Assuming one functional haplotype in the locus, the beta for each eQTL should correlate with its LD (R2) to the haplotype marked by the highest scoring

SNP. Considering each candidate gene by tissue combination separately, we find the observed eQTLs for a gene cannot generally be explained by just one signal. ADTRP in whole blood, LIPA in transverse colon, and SWAP70 in spleen are notable exceptions – for each gene x tissue combination there is evidence of only one functional haplotype marked by the GWAS variant (Figure 5). For the remaining gene x tissue combinations 22 there is evidence for an additional signal contributing to gene expression (Figure 5,

Figure 32, Table 1).

Figure 5. Number of distinct signals associated with eQTLs. ADTRP (whole blood), LIPA (transverse colon), SWAP70 (spleen) - eQTLs accounted for by a single haplotype marked by GWAS variant. (B). KCNK5 in subcutaneous adipose, LIPA in whole blood, and ALDH2 in sun exposed skin – linkage disequilibrium between eQTLs indicates multiple functional haplotypes. Correlation between absolute value of beta and LD (R2) with the top eQTL (most significant p-value) for all significant eQTLs in the given gene x tissue combination. GWAS variant is noted in red.

23 Table 1. Number of distinct signals associated with expression for genes implicated by GWAS and reprioritized by eQTL analysis. Number of independent signals was determined based on correlation plots between beta for association with gene expression and linkage disequilbrium (R2) with top scoring SNP. (Figure 5, Figure 32).

SNP eGenes Tissues # Reported # Independent eQTLs signals 1 rs932344 ADTRP Whole Blood 3 1 2 rs72689147 RP11-588K22.2 Nerve Tibial 30 > 1 3 rs10139550 HHIPL1 Adipose 93 > 1 Subcutaneous Cells Transformed 61 ? fibroblasts 4 rs56336142 KCNK5 Adipose 51 > 1 Subcutaneous 5 rs2487928 KIAA1462 Artery Aorta 61 ? 6 rs1412444 LIPA Adipose 28 > 1 Subcutaneous Adrenal Gland 15 ? Colon Transverse 12 1 Lung 28 > Pancreas 14 ? Spleen 17 > 1 Thyroid 22 > 1 Whole Blood 44 > 1 7 rs55730499 SLC22A3 Skin Sun Exposed 469 > 1 Lower leg 8 rs17087335 RP11-738E22.3 Skin Sun Exposed 315 > 1 Lower leg 9 rs12202017 TCF21 Cells Transformed 36 ? fibroblasts Muscle Skeletal 27 ? 10 rs11065979 ALDH2 Skin Sun Exposed 757 > 1 Lower leg 11 rs56062135 SMAD3 Thyroid 24 > 1 12 rs10840293 SWAP70 Skin Not Sun 24 > 1 Exposed Spleen 7 1

24 As an example, we consider the number of signals needed to account for LIPA expression in whole blood. We find the most significant signal consists of a group of tightly linked SNPs marked by the GWAS variant. We find two additional clusters of

SNPs with an equally robust beta (and p-value) that show relatively poor linkage disequilibrium with the cluster marked by the GWAS variant (R2 ~ 0.5) (Figure 5, Figure

33). These SNPs are more significant than predicted by their LD with the trait-associated variant and may mark additional functional haplotypes in the locus. To test the significance of the additional haplotypes, we used a separate dataset (CATHGEN) to evaluate whether including a marker from the additional haplotypes in a regression model improved the ability to account for LIPA expression. We found that including markers from the additional haplotypes improved the model, while adding a marker from the same haplotype did not (Table 2). This indicates there are at least three signals that contribute to LIPA expression in whole blood. There is no evidence for a role of these additional haplotypes in accounting for MI suggesting the relationship between the GWAS variant,

LIPA expression, and MI is not linear.

25

Figure 6. Three distinct signals associated with LIPA eQTLs in whole blood. Correlation between absolute value of beta (A, B) or p-value (C) and LD (R2) with the top eQTL (most significant p-value) for all significant eQTLs. In the first plot from the left, GWAS variant is noted in red. The next two plots show tightly linked SNPs (dprime > 0.9; R2 > 0.9) in the same color, forming three separate haplotypes: red (includes GWAS variant), blue, and green.

Table 2. Additional haplotype for LIPA expression in whole blood improves model for gene expression but not MI. Analysis of variance (ANOVA) comparing ability of models with different SNP combinations to account gene expression (XP; Illumina probe: ILMN_1718063) and myocardial infarction (MI).

Variable of ANOVA p Model 1 Model 2 interest value rs1412444 2.2e-16 XP ~ sex + age XP ~ rs1412444 + sex + age (GWAS) rs13332328 2.2e-16 XP ~ sex + age XP ~ rs13332328 + sex + age (GWAS) rs1051338 2.2e-16 XP ~ sex + age XP ~ rs1051338 + sex + age (Additional-1) rs2250781 2.2e-16 XP ~ sex + age XP ~ rs2250781 + sex + age (Additional-2) rs1412444 in 0.70 XP ~ rs1412444 + sex + XP ~ rs1412444 + context of age rs13332328+ sex + age rs13332328 rs1412444 in 0.08 XP ~ rs1412444 + sex + XP ~ rs1412444 + rs1051338 context of age + sex + age rs1051338 rs1412444 in 0.014 XP ~ rs1412444 + sex + XP ~ rs1412444 + rs2250781 context of age + sex + age rs2250781 rs1412444 & 0.096 XP ~ rs1412444 + XP ~ rs1412444 + rs2250781 26 Continued Table 2 continued rs2250781 in rs2250781 + sex + age + rs1051338 + sex + age context of rs1051338 rs1412444 & 0.017 XP ~ rs1412444 + XP ~ rs1412444 + rs1051338 rs1051338 in rs1051338 + sex + age + rs2250781 + sex + age context of rs2250781

rs1412444 0.0002638 MI ~ sex + age MI ~ rs1412444 + sex + age (GWAS) rs13332328 0.0002684 MI ~ sex + age MI ~ rs13332328 + sex + age (GWAS) rs1051338 0.0001477 MI ~ sex + age MI ~ rs1051338 + sex + age (Additional) rs2250781 0.0002987 MI ~ sex + age MI ~ rs2250781 + sex + age (Additional-2) rs1412444 in 0.61 MI ~ rs1412444 + sex + MI ~ rs1412444 + context of age rs13332328+ sex + age rs13332328 rs1412444 in 0.37 MI ~ rs1412444 + sex + MI ~ rs1412444 + rs1051338 context of age + sex + age rs1051338 rs1412444 in 0.93 MI ~ rs1412444 + sex + MI ~ rs1412444 + rs2250781 context of age + sex + age rs2250781 rs1412444 & 0.37 MI ~ rs1412444 + MI ~ rs1412444 + rs2250781 rs2250781 in rs2250781 + sex + age + rs1051338 + sex + age context of rs1051338 rs1412444 & 0.93 MI ~ rs1412444 + MI ~ rs1412444 + rs1051338 rs1051338 in rs1051338 + sex + age + rs2250781 + sex + age context of rs2250781

Tissue Specific eQTLs

To consider how eQTLs for a given gene compare across different tissue types,

we cluster and plot eQTLs by their linkage disequilibrium (R2) using a colored bar

shaded by the significance of the eQTL (p-value) at the top of the heatmap to denote

27 tissue type (Figure 7). We find SNPs generally cluster by tissue, suggesting distinct functional haplotypes in different tissues and emphasizing the tissue specific nature of gene regulation. Of note, the haplotype detected by GWA studies often appears as a significant eQTL only in a handful of tissues, some of which fit with our understanding of CAD pathology, others that suggest as of yet undiscovered biology.

28

Figure 7. Linkage disequilibrium between LIPA eQTLs in all tissues reveals multiple distinct haplotypes. eQTLs are clustered by their linkage disequilibrium (R2). Colored bars at the top of the plot reflect the strength of the eQTLin each tissue type with more significant p-values denoted by a darker color.

Multiple haplotypes associated with expression in different tissues, for example, serves as a reminder that the pairing need not be one SNP – one gene. Rather multiple

SNPs can direct expression of a single gene and a single SNP can direct expression of 29 multiple genes, in whatever combination across tissues proves to be the most advantageous to what is likely a diverse array of pressures.

Transcript Specific eQTLs

Given the complexity of these loci that often harbor multiple transcripts, we asked whether there was evidence for transcript specific eQTLs. None of the GWAS variants are associated with splicing in either of two different algorithms considered by the GTEx consortium (sQTLSeekeR splicing QTLs (splicing isoform ratio QTLs) and Altrans splicing QTLs (splice junction QTLs). This indicates that these particular variants do not associate with specific splice isoforms, but does not inform our understanding of genetic variants throughout the locus or other types of isoforms.

To consider the question of transcript specific eQTLS more broadly, for a 1mb window around the published eQTL range for the gene, we calculated the association between each SNP and expression of both the gene and individual transcripts. Plotted here is a local manhattan plot for LIPA gene and transcripts in whole blood (Figure 8).

Although the patterns are overall quite similar between the gene and individual transcripts, we find divergent SNPs that suggest the presence of transcript specific eQTLs.

30

Figure 8. Local manhattan plot for LIPA in whole blood suggests presence of transcript specific eQTLs. Postion is denoted on the x-axis with the black bar indicating the gene start and stop (note: this varies by transcript, but an approximation for the gene is depicted here for clarity). P-value for association with expression is denoted by the y- axis. Each transcript is marked by different color (see legend).

Variance eQTLs

eQTL analysis considers the association between a given genotype and expression level of a gene. A complimentary approach for identifying functional genetic variants in the locus is to consider those SNPs that are associated not with the absolute level of gene expression, but rather with its variability. Variability has important implications for evolution as natural selection acts on what is available and more variability in a trait provides more possibilities - such that genetic variants that increase the variability of a trait maybe more exposed to selection. It has also been suggested that large variability

31 may indicate the presence of an interaction and variance can be used to guide the search for epistasis (Rönnegård et al., 2012).

To test for SNPs associated with variability in expression, we fit a mixture model to normalized expression values (i.e. we used the residuals from a generalized linear model – see methods). The output of the mixture model includes both the number of components or ‘states of gene expression, with the possibility to identify only a single component, and the probability for each individual value falling into the different components. For those transcripts where we identified multiple mixture components (of reasonable size), suggesting the presence of two or more ‘states’ of expression, we assessed the mean and variance of each component, being particularly interested in those cases where there was a substantial difference in variance between components. We tested for SNPs that associated with the posterior probability of a being assigned to one component versus another. Using this approach, we find several examples where a SNP is associated with the posterior probability of being in a mixture component with higher or lower variance, i.e. examples where a SNP is associated with variability in expression.

In order to filter for a biologically meaningful interaction, we use this structure to test for a particular scenario in which there is an interaction between a SNP and a transcription factor predicted for the given gene. The hypothesis being that large variation in gene expression would only be observed when both the transcription factor was expressed and the SNP that alters its efficacy is also present. Accordingly, we tested for a difference in correlation between the expression level of the transcription factor and the posterior probability of being assigned to a mixture component in the presence or absence of the SNP. Here we consider HHIPL1 one of the genes that was re as an 32 example. Considering a mixture model for expression of the primary transcript (HHIPL-

1; ENST00000330710) reveals two components of approximately equal size (56% of data; 44% of data) with similar means (0.1 RPKM; 0.15 RPKM) but different variances

(0.000824; 0.00345). Furthermore, we find a SNP downstream of HHIPL1, rs2273840, in CYP46A1 that is associated with the probability of being in one mixture component versus the other (p = 2.57 x 10^-6). When we consider the correlation between expression of a predicted transcription factor for HHIPL1, YY1, and the posterior probability of falling into the first versus the second mixture component we find a significant association (p = 0.05) when the SNP is present and a non-significant association (p = 0.43) when the SNP is absent. This suggests that a SNP downstream of

HHIPL1 modifies the relationship between YY1 and HHIPL1 so that when the SNP is present it increases the variability in expression of HHIPL1. This serves as proof of concept both for the existence of SNPs associated with variability in gene expression and for transcription factors to be involved in mediating this process.

Discussion

We find that the genomic loci implicated in CAD by GWAS often harbor multiple protein-coding and non-coding transcripts. eQTL based reprioritization indicates the relevant gene may in some cases be a non-coding RNA in the locus. Both rs72689147

(RP11-588K22.2 previously annotated as GUCY1A3) and rs17087335 (RP11-738E22.3 previously annotated as REST-NOA1) are eQTLs for a single non-coding RNA in the locus rather than the annotated protein-coding gene.

33 Those 16 GWAS variants that serve as eQTLs for multiple genes in the locus emphasize that a single SNP can regulate multiple genes and need not have just one target. For example, rs6689306 is an eQTL for both IL6R and TDRD10. It falls in a

DNAse hypersensitivity site where it exhibits allelic expression imbalance, suggesting the SNP is linked with openness of the chromatin in this locus and by this means alters expression of multiple genes.

None of the GWAS hits are themselves associated with particular splice isoforms, but we do find evidence for transcript specific eQTLs. Given the sheer number of transcripts in these loci and the potential for transcripts with highly specialized functions relevant to disease, assessing transcript specific eQTLs is a prudent next step in interpreting GWAS - perhaps only for candidate genes as many datasets are underpowered for such an analysis on a genome-wide scale.

Apart from a handful of examples, tissue specific expression patterns rarely reveal the relevant tissue or gene, even when the GWAS variant is itself an eQTL. All available information suggests that uncovering the mechanism for the observed trait-association will likely require targeted molecular studies. The distribution of PubMed articles and

GO ids across these candidate genes, suggests that this targeted work has touched on only a handful of genes thus far.

Incorporating large-scale data on gene expression can inform our understanding of GWAS-based associations. Limitations certainly exist, perhaps not the least of which is that results are limited by the datasets available and subject to all the biases that come with association studies. For example, Braenne et al. previously reported that the CAD

GWAS hit rs2895811 located in an intron of HHIPL1 was related to YY1 rather than 34 HHIPL1. However, here we find rs2895811 associated with expression of HHIPL1 in both subcutaneous adipose and transformed fibroblasts and no evidence for an association with expression of YY1. We do however find that variability of HHIPL1 expression is associated with YY1 expression in the presence of a different SNP downstream of HHIPL1. These two genes likely share a functional relationship modulated to some extent by genetic variants, which may contribute to the signal detected by GWAS.

We find that the eQTLs for a given gene x tissue combination can rarely be explained by a single functional haplotype, rather there are multiple signals that contribute to expression of a gene. Deciphering how these function in combination will likely improve our ability to account for the heritability of disease.

Conclusions

Interpreting GWAS-based candidate SNPs requires an approach that considers tissue specific effects of SNPs on multiple protein coding and non-coding transcripts in the locus. For many genes, there is evidence for multiple functional haplotypes directing expression.

35 Methods

Data

CATHeritization GENetics (CATHGEN)

Expression data, genotypes, and clinical phenotypes were acquired from

CATHeritization GENetics (CATHGEN) via dbGaP Project #5358 (dbGaP accession number phs0000703 on 25 March 2015). Expression levels had been determined using

Illumina HumanHT-12 v3 in RNA from whole blood. We considered age

(phv00197199), gender (phv00197207), and race (phv00197206) as additional covariates in our models recorded in pht003672. Myocardial infarction (MI), defined by a previous recorded history of MI (phv00197212) or a non-zero CAD index (phv00197202) recorded in pht003672, was used as a phenotype/outcome variable. Our access of this study was approved by the Ohio State University IRB (Protocol #2013H0096).

Genotype and Tissue Expression Project (GTEx)

Tissue specific RNAsequencing data was acquired from the Genotype and Tissue

Expression Project (GTEx) via dbGaP Project #5358 (dbGaP accession number phs000424 in April 2014). RNAsequencing had been generated using poly-adenylated priming with reads aligned to HG19, Gencodev12. For further details see Lonsdale et al. and the GTEx website (http://www.gtexportal.org/home/documentationPage)(Lonsdale et al., 2013). We used the RPKM as the measure of expression.

Gene Information

36 We obtained transcripts, coding status, and GO Ids for each gene with the aid of

'biomaRt' package implanted in R. Non-coding transcripts were defined as those with

Ensembl reported biotypes: 3prime overlapping ncRNA, antisense, lincRNA, miRNA, misc RNA, nonsense mediated decay, polymorphic pseudogene, processed pseudogene, processed transcript, sense intronic, sense overlapping, snoRNA, snRNA, transcribed processed pseudogene, transcribed unprocessed pseudogene, unprocessed pseudogene. Protein-coding transcripts were defined as: IG V gene, protein coding, retained intron. Number of PubMed articles was obtained with the ‘NCBI2R’ package.

Linkage Disequilibrium

Linkage disequilibrium was calculated using ‘ld’ function in the ‘snpStats’ package implemented in R based on data from 1000 genomes CEU population.

Testing clinical significance of identified variants

Additive logistic models were compared using ANOVA with a likelihood ratio test (LRT) implemented in R. Age and gender were used as covariates. The dataset was restricted to only those of Caucasian race. GLM p-values are reported for the standard

Wald’s test and in ANOVA are reported for LRT (unless otherwise stated).

Variance eQTLs

Gene expression was normalized by taking the residuals from a glm fit with gene expression explained by covariates (PEER factors calculated by GTEx, age, gender) selected using the cv.glmnet function in the ‘glmnet’ package implened in R. A Gaussian 37 mixture model was fit using the ‘mclust’ package. Association between SNPs and posterior probabilities of being assigned to one mixture component versus the other was calculated using logistic regression. The p-value is reported. We used ConSite to determine likely transcription factors for each gene. Association between expression of a transcription factor and the posterior probability of being assigned to one mixture component versus the other was calculated using Kendall’s rank correlation test. The p- value is reported.

38

Chapter 3: Non-linear Interactions Between Candidate Genes of Myocardial Infarction

Revealed in mRNA Expression Profiles

Background

Large scale GWAS have revealed numerous candidate risk alleles for complex disorders, such as MI and CAD (Deloukas et al., 2013; Roberts, 2014). However, genetic risk of disease at the population level cannot be accounted for by individual genetic variants in single genes or even by summing the individual effects of dozens of genes – a gap often referred to as the ‘missing heritability’ (Maher, 2008; Sadee et al., 2014). Thus far, epistasis has also failed to account for heritability (Lippert et al., 2013; Moore &

Williams, 2005). Other factors may involve additional loci not detectable by GWAS, non-additive interactions between candidate genes, and external factors modulating gene expression and interactions (Schadt, 2009; Senner, Conklin, & Piersma, 2015). Instead of restricting the analysis to genetic variants, we focus on RNA expression levels that integrate multiple factors including genetic differences influencing expression, gene-gene interactions, complex feedback mechanisms, and environmental influence.

Previous studies of differential mRNA expression in MI and related phenotypes have identified ~1,500 individual genes, but less than 5% replicate across more than one study (Aziz, Zaas, & Ginsburg, 2007; Hiltunen et al., 2002; Joehanes et al., 2013; Nanni,

Romualdi, Maseri, & Lanfranchi, 2006; Randi et al., 2003; Seo et al., 2004; Sinnaeve et 39 al., 2009; Wingrove et al., 2008). Poor replication is often attributed to differences in cohorts and methodology between studies, but also reflects the fact that disease relevant signals are dispersed across many interacting genes (i.e. a network) and that gene expression varies greatly across individuals with different genetic and environmental backgrounds. Low reproducibility of differentially expressed genes limits the biological interpretation and utility of these findings.

Using cluster analysis, GO enrichment analysis, and a variety of machine-learning approaches, studies of differential expression in MI have also implicated particular pathways (e.g. caspase cascade, apoptosis signaling) and cell types (e.g. CD71+ erythroid, NK cells) (Aziz et al., 2007; Hiltunen et al., 2002; Joehanes et al., 2013; Nanni et al., 2006; Randi et al., 2003; Seo et al., 2004; Sinnaeve et al., 2009; Wingrove et al.,

2008). Such efforts however have not addressed how non-linear interactions between these genes and pathways contribute to MI, an important question as complex disease likely do not arise from single perturbations but rather a dysregulation of the molecular network. Indeed, Wu et al. find pairwise non-linear interactions to be important for disease classification and biomarker development in breast cancer (M.-Y. Wu et al.,

2016).

Focusing on myocardial infarction (MI), we select a well-defined, small number of GWAS-derived candidate genes to probe mechanisms inaccessible on a genome-wide scale. Using whole blood expression arrays from the CATHeterization GENetics Study

(CATHGEN) and the Framingham Heart Study (FHS), we test the association of RNA expression profiles with MI. We focus first on individual genes and then expand to consider non-linear interaction terms between pairs of candidate genes. 40 Analysis of differential co-expression between RNAs can enhance our understanding of how dynamic feedback mechanisms between pairs and defined networks of mRNAs determine disease risk (Gaiteri, Ding, French, Tseng, & Sibille,

2014; Hao et al., 2015; Stergachis et al., 2014; Zhao, Luo, Chen, Xiao, & Chen, 2014).

Gene networks can be extracted from existing databases, (e.g., Ingenuity Pathway

Analysis, KEGG, etc). However, most databases are generated from mining previously published literature and are thus biased towards those pathways most studied and often neglect tissue specificity and other nuances of gene-gene interactions. Tissue specific co- expression networks allow us to define relationships between genes in disease relevant tissues.

Despite progress in the field of network biology, existing methods do not fully account for variability in co-expression across individuals with different genetic and environmental backgrounds, even in those cases where the underlying methodology is tailored to detect non-linear dependency patterns. To overcome this limitation, we employ a resampling procedure that generates a quantitative measure of the stability of co-expression across individuals.

To inform the understanding of non-linear pairwise interactions associated with

MI, we construct small-scale, tissue specific co-expression networks with candidate genes in healthy individuals using data from blood and artery in GTEx. Blood supports biomarker discovery, and artery, as the site of atherosclerotic plaque formation characteristic of CAD, captures disease-relevant physiology. We hypothesize that the relevance of a non-linear interactions for disease will be reflected by a difference in the strength or variability of co-expression between cases and controls, rather than a binary 41 (presence-absence) switch in co-expression. Accordingly, we expect connections between genes in the co-expression network to be the same in both diseased and healthy individuals but their quality to change between those with and without MI.

Non-linear interactions between candidate genes associated with MI are unlikely to represent direct, physical interactions between mRNAs but rather distinct biological processes that are coordinately regulated. Thus, we expand co-expression networks beyond candidate genes to identify additional mRNAs that may serve to mediate the observed interactions.

Results

Differentially expressed individual candidate genes and their interactions in myocardial

infarction (MI)

We analyzed RNA profiles (measured by expression arrays in blood) from

CATHGEN and the Framingham Heart Study in subjects with and without a history of

MI. Focusing on established candidate genes, we searched for: (1) individual of RNAs and (2) interactions between RNAs, significantly associated with MI status, noting those that replicate between the two cohorts.

GWAS based candidate genes in MI

The CAD Genome Wide Association Study performed by the CARDIOGRAMplusC4D consortium published in 2014, including >60,000 cases and 130,000 controls, identified

45 loci associated with MI and CAD at genome-wide significance (Table 12) (Deloukas et al., 2013; Roberts, 2014). Forty-one of the candidate genes assigned to these loci had 42 one or more corresponding probes on expression arrays used in both CATHGEN (89 probes) and the Framingham Heart Study (41 probes). All probes corresponding to these genes were tested for an association between expression and MI using logistic regression.

Expression of individual candidate genes in MI

The approach for detecting differentially expressed RNAs was designed for each study separately as CATHGEN and FHS include different proportions of cases/controls, levels of relatedness, and racial diversity.

CATHGEN. We measured the association between expression of each individual probe, assigned to the 41 candidate genes, and MI status using logistic regression with age, race, and gender as additional covariates in the model (n = 1250; 359 cases and 891 controls).

We identified 14 candidate genes with at least one probe displaying expression levels nominally (p < 0.05) associated with MI: PEMT, RAI1, LPAL2, PDGFD, FES, ZC3HC1,

PHACTR1, CNNM2, GUCY1A3, UBE2Z, MRAS, FURIN, IL6R, MIA3 (Figure 9,

Table 12).

Framingham Heart Study. As a population-based cohort, FHS had a significantly smaller prevalence of MI; accordingly we used a matched case-control design. MI cases with expression data available (n = 193) were matched to controls (selected from a pool of

4,952 subjects) on age, gender, and different family assignment. To assess the robustness of association between mRNA levels and MI status, conditional logistic regression analysis was performed 5000 times, each time on a different random set of matched controls. We identified 8 candidate genes displaying expression levels nominally (p <

0.05) associated with MI in half or more of the 5000 resamples (i.e. in which the sample 43 median of the p-values was less than 0.05): TRIB1, VAMP8, FES, PHACTR1, ZEB2,

NT5C2, FLT1, and SMG6 (Figure 9, Table 12), with PHACTR1 and FES replicating from CATHGEN.

44

Figure 9. Differentially expressed candidate genes in MI. A. CATHGEN. P-value of association between expression of probe ID (labeled by assigned gene) and MI status from logistic regression with age, race, and gender as additional covariates. Genes with p-value less than 0.05 are colored in red. B. Framingham Heart Study. Number of resamples (of total 5000) in which mRNA expression of GWAS-based candidate gene was significantly associated with MI (p < 0.05) in conditional logistic regression model. Controls were matched to cases on age, gender, and belonging to a different family. TRIB1, VAMP8, FES, PHACTR1, ZEB2, NT5C2, and SMG6 (colored in red) had expression profiles strongly associated with MI (i.e. median p-value among bootstrap replicates < 0.05). C. Histograms of p-values for genes determined as significant in CATHGEN that did not meet the criteria for being associated with MI in FHS. FURIN, IL6R, RAI1, and UBE2Z were informative of MI based on right-skewed histogram. D. Venn diagram displaying overlap between genes individually significant in CATHGEN and the Framingham Heart Study.

Our resampling approach results in a distribution of main effects and p-values that reflects the biological and environmental diversity of cases and controls. We evaluated

45 these distributions in FHS for all genes significant in CATHGEN. With a right-skewed distribution, probes in FURIN, IL6R, RAI1, and UBE2Z were considered to be informative of MI (Figure 9). With approximately uniformly distributed p-values, probes in CNNM2, GUCY1A3, MRAS, MIA3, PDGFD, PEMT, and LPAL2 did not show evidence of differential expression with MI in FHS (Figure 10). Uncertainty in correctly assigning differentially expressed genes indicates that single mRNA expression profiles in peripheral blood are only moderately informative of MI status but still can serve as a basis for further analysis.

46

Figure 10. Differentially expressed candidate gene pairs in MI. Nodes are genes and edges reflect significant non-linear interactions defined by logistic models (see methods). Pink nodes indicate genes that are individually significant (see Figure 9), red lines indicate gene pairs that are co-expressed (see Figure 4), dotted lines indicate gene pairs that were not tested for co-expression due to poor expression of one or more of the genes, bolded edges indicate non-linear interactions identified in both cohorts (CNNM2|GUCY1A3; CNNM2|ZEB2). A. CATHGEN. Connected graph formed by pairs of genes with a significant interaction term in CATHGEN using a logistic model (logit(MI) ~ RNA1 + RNA2 + RNA1*RNA2 + additional covariates). B. Framingham Heart Study. Disconnected graphs formed by pairs of genes exhibiting statistically significant interaction terms in FHS, defined by means of the bootstrapped conditional logistic regression (logit(MI) ~ RNA1 + RNA2 + RNA1*RNA2) with 5000 repetitions and matching procedure the same as for individual genes.

Pairs of interacting candidate genes

Expression of candidate gene pairs in MI

47 In addition to searching for individual differentially expressed genes, we considered pairs of mRNAs with and without an interaction term in the model. The expression profiles of mRNA pairs were evaluated for association with MI, using the same approach applied to individual mRNA expression levels in CATHGEN and in

Framingham.

CATHGEN. Considering an additive logistic model (logit(MI) ~ RNA1 + RNA2

+ additional covariates) revealed 56 (of 3,916) pairs of probes with both RNA terms as significant (p < 0.05) and 1,064 with only one RNA term as significant in the model (for details of the model, see Methods section). These 1,120 pairs of probes include only 20 of the 40 candidate genes. Applying the same criterion of significance to an interactive model (logit(MI) ~ RNA1 + RNA2 + RNA1*RNA2 + additional covariates) revealed 53

(of 3,916) pairs of probes with both RNA terms as significant (p < 0.05) and 624 with only one RNA term as significant. These 677 pairs include 38 of the 40 candidate genes.

We find an interaction model reveals fewer pairs of RNAs that are comprised of a more diverse set of genes (Table 3).

Table 3. Comparison of additive models with and without non-linear interactions between candidate gene pairs in CATHGEN. Number of significant pairs defined as having one or both RNAs associated with MI (p< 0.05) and number candidate genes forming those pairs. Shown for an additive versus interactive model. Additive model: MI as explained by main effects of candidate genes (logit(MI) ~ RNA1 + RNA2 + additional covariates). Interactive model: MI as explained by main effects of candidate genes and an interaction term (logit(MI) ~ RNA1 + RNA2 + RNA1*RNA2 + additional covariates).

Additive Model Interactive Model Number of pairs 1120 677 Number of candidate genes 20 38

48 We define a pair of RNAs to be interacting non-linearly if we detect a statistically significant (p<0.05) interaction term in the logistic model for this pair (regardless of the significance of individual RNA terms as evaluated above). By this definition, we found

167 (of 3,916) pairs of probes defined as interacting in CATHGEN. These probe pairs represent 149 gene pairs that include nearly all of the candidate genes (40 of 41).

Framingham Heart Study. As before, we performed conditional logistic regression 5000 times on different sets of matched cases and controls, and recorded the number of times each term in the model was significant (p<0.05). We analyzed both the additive model accounting for MI with expression of gene A and B (MI ~ gene A + gene

B) and the model with an additional interaction term (MI ~ gene A + gene B + gene

A*gene B). We identified 6 of 903 possible pairs with strong evidence of a non-linear interaction between the two candidate genes (sample median of p-values for the interaction term less than 0.05): CNNM2|GUCY1A3, CNNM2|PEMT, CNMM2|ZEB2,

IL6R|LPAL2, MIA3|SLC22A5, MIA3|ZC3HC1. For each of these six pairs, adding the interaction term to the model greatly decreased the median p-values for all terms in the model, suggesting a functionally significant dynamic interaction in the context of MI

(Table 4). Two pairs: CNNM2|GUCY1A3 and CNNM2|ZEB2 were also observed in

CATHGEN.

49 Table 4. Interactions between candidate gene pairs in the Framingham Heart Study. Percent of significant resamples (p < 0.05) for each term in 5,000 repetitions of conditional logistic regression. Additive model: MI as explained by main effects of candidate genes (logit(MI) ~ RNA1 + RNA2). Interactive model: MI as explained by main effects of candidate genes and an interaction term (logit(MI) ~ RNA1 + RNA2 + RNA1*RNA2).

Percent Significant Resamples

Additive Model Interaction Model

Gene1 Gene2 Gene1 Gene2 Gene1* Gene2

1% 2% 89% 89% 86% CNNM2|GUCY1A3 1% 1% 54% 54% 54% CNNM2|PEMT 0% 75% 56% 51% 56% CNNM2|ZEB2 9% 1% 56% 59% 60% IL6R|LPAL2 1% 5% 84% 75% 84% MIA3|SLC22A5 1% 2% 58% 45% 57% MIA3|ZC3HC1

Of the genes forming these six pairs, only ZEB2 and IL6R were also significant

by themselves (Table 4, Figure 9). The others were not differentially expressed in the

Framingham Heart Study, but 7 of the remaining 8 (CNNM2, GUCY1A3, PEMT,

LPAL2, MIA3, and ZC3HC1) were differentially expressed in CATHGEN as individual

RNAs. Detectable overlap between CATHGEN and FHS supports the finding of non-

linear interactions relevant for MI.

MI-relevance of non-linear interacting mRNA pairs

50 We further investigated how these non-linear RNA interactions affect the odds of

MI by generating sample distributions of effect sizes based on the resampling procedure used in the FHS. For the two pairs that replicated between CATHGEN and FHS

(CNNM2|GUCY1A3 and CNNM2|ZEB2) we used the distributions of effect sizes to calculate the odds ratio of MI given that the expression of gene A increases by one standard deviation as a function of expression of gene B. In both cases we observe that the expression of one gene (e.g. CNNM2) appears protective when expression of the other (e.g. GUCY1A3) is low and deleterious when it is high (Figure 11). The relevance of these non-linear interactions in accounting for MI status is reflected by both: (1) the reproducibility of the odds ratio curve between different replicates (i.e. all grey lines in the sample distribution of odds ratios fall within the same region, Figure 11), and (2) the saddle surface of the expression versus log-odds 3D plot (Figure 11). In Figure 11, a histogram of gene expression across the population is presented to further assess the utility of this RNA as a biomarker of MI. We observe strongly protective or deleterious mRNA-mRNA ratios in a limited portion of the overall population; nevertheless, the use of multiple gene-gene pairs has potential utility in biomarker panels. The question remains of how these pairs relate to one another.

51

Figure 11. Odds ratios and conditional odds ratios of MI for non-linearly interacting pairs. A. Conditional odds ratios of MI from one standard deviation increase in the mean expression of gene A, plotted against expression of the second interacting gene, gene B. Histograms of gene B expression are displayed above each panel. The odds ratio is defined as:odds(Ex_A+sd(Ex_A))/odds(Ex_A) where Ex_A and sd(Ex_A) denote the expression level and the standard deviation of expression of gene A respectively. Red lines indicate an odds ratio of 1. In panel B, the interaction is displayed as surface plot, which is a three dimensional plot of expression of the two genes versus the log-odds of MI. The curvature of the saddle shape surface indicates the magnitude of the interaction term in the model.

Connectivity between non-linear interacting pairs

We investigate the relationship between pairs of non-linearly interacting RNAs by presenting them as a graph of interactions. Pairs with significant interaction terms form a connected graph, which is denser for CATHGEN than Framingham, likely because of the larger number of cases in CATHGEN (Figure 10). The high connectivity of the graph of 52 non-linearly interacting RNAs supports the hypothesis that complex phenotypes arise as a destabilization of feedback within a dense molecular network. The six pairs in FHS do not form a densely connected graph, but do indicate interactions between genes only indirectly connected in the CATHGEN network (Figure 10).

It appears that the selected 41 candidate genes are informative but support only sparsely populated physiological networks. Therefore, we proceeded to co-expression analysis to both assess the biological plausibility that these genes are co-regulated and incorporate additional RNAs that could serve as linkers (relays) between the 41 candidate genes.

Co-expression of candidate genes and inclusion of additional genes complementing

sparse networks

Co-expression patterns among 41 candidate genes

We used co-expression in healthy individuals to assess the biological plausibility of non-linear interactions associated with disease. We evaluated co-expression by building small co-expression networks for the 41 candidate genes using GTEx

RNAsequencing data from ‘healthy’ individuals in whole blood (n=190) and arteries

(n=137) (Figure 12).

53

Figure 12. Tissue-specific co-expression networks. Nodes are genes and edges reflect co- expression defined by our algorithm (see methods). Co-expression of candidate genes in artery (A), blood (B), and both artery and blood (C) show similar structure between the two tissues with little overlap in particular gene pairs. Edges within a single gene indicate co-expression between isoforms of the same gene.

Network Construction Procedure. Our algorithm is designed to evaluate the stability of co-expression between pairs of RNAs, rather than the strength of the co- expression per se. Similar to the ARACNE procedure, our method uses an information- theoretic divergence measure (Renyi divergence) to detect 'direct' non-linear dependencies between expression profiles (Margolin et al., 2006; Rempala & Seweryn,

2013). To quantify the stability of 'dynamically co-expressed' pairs, we generate networks on multiple (50,000) random subsamples of individuals and create a consensus network where weights on the edges between mRNAs are defined by the proportion of subsamples where the two mRNA transcripts are observed as co-expressed (see methods section, Appendix G and Figure 13). 50% of all possible pairs are observed in at least one of the 50,000 resamples. For further analysis, we focus on those observed in at least half

54 of the 50,000 resamples, representing only 1% of all possible pairs (for rationale see

Methods section).

Figure 13. Introduction of the network construction algorithm.

Disease Interactions and Co-expressions. Twelve (11%) of the 108 non-linearly interacting pairs in CATHGEN were co-expressed in artery and 21 (19%) in blood, indicating the potential for coordinated regulation of these mRNAs (Figure 12). Those gene pairs that were co-expressed in blood were more likely to demonstrate non-linear interactions associated with MI as manifested by a greater specificity in MI prediction among co-expressed pairs (Figure 14). One of the non-linear interacting pairs detected in both CATHGEN and FHS (CNNM2|GUCY1A3) was found to be dynamically co- expressed in both artery and blood. CNNM2|GUCY1A3 was co-expressed in 51% of the

50,000 resamples in artery and 53% in blood. These results suggest co-expression patterns determined in healthy individuals are also relevant for disease.

55

Figure 14. Co-expression as a filter for disease-relevant interactions. Proportion of true null hypotheses (estimated via the qvalue package in R) when testing differential expression of: single genes, gene pairs, co-expressed pairs. A lower proportion of the true null indicates greater specificity of the model.

Co-expression patterns among candidate genes in networks including additional gene transcripts

Underlying physiologic mechanisms linking gene expression and MI are indirectly reflected in the sparse networks. To overcome the limitations of these sparse networks, we built large-scale co-expression networks around the 41 candidate genes using 13,000 additional transcripts. At this scale, we expect to observe more biologically credible connections than in a small co-expression network limited to candidate genes.

None of the candidate genes were directly co-expressed in these larger networks, indicating that these represent biological processes that are co-regulated but not necessarily in the same pathway. Our analysis revealed indirect connections, i.e., ‘relay’ transcripts that serve to connect candidate genes. Figure 15 displays the shortest paths in 56 blood between a robust pair of candidate mRNAs that interacts in the context of disease:

CNNM2|GUCY1A3.

Figure 15. Shortest paths between candidate genes in large-scale co-expression network. Five shortest paths between CNNM2 and GUCY1A3 in co-expression network built using ~13,000 transcripts in artery. Nodes (genes) labeled in pink and edges (pairs) labeled in red are significant in a CATHGEN based logistic model explaining MI by expression of each gene in the path and each pairwise interaction term along the path (i.e. logit(MI) ~ RNA1 + RNA2 + RNA3 + RNA1* RNA2 + RNA2 * RNA3 + additional covariates).

We evaluate the disease relevance of these intermediary co-expressions by considering a logistic model in CATHGEN with all individual RNAs and pairwise non- linear interactions defined by this path in the large-scale network tested against MI. In the five shortest paths between CNNM2 and GUCY1A3, we find four individual genes

(CNNM2, NUPR1, UBN1, GUCY1A3), and five interactions (CNNM2|ACSL5,

CNNM2|IRF2BP1, ZNF319|UBN1, CNBP|NUPR1, NUPR1|GUCY1A3) associated with

MI. These links between candidate genes serve as additional candidate genes and interactions.

57 Testing for epistasis in non-linear interacting candidate gene pairs

Having identified pairwise interactions significant for MI on the RNA level, we asked whether similar interactions occur between SNPs in candidate genes (pairwise epistasis). Using the Framingham Heart Study, we applied the same approach of resampled conditional logistic regression used for RNA interactions. For the 16 candidate genes that were either differentially expressed or formed an interacting pair in the Framingham Heart Study, we considered all possible pairwise interactions between candidate SNPs (GWAS hits published in the GWAS catalog and eQTLs published by

GTEx) in the 16 genes. This approach yielded several examples with evidence of epistasis (Table 13), but no evidence for an epistatic interaction between variants in those pairs of genes that exhibited an interaction on the RNA level. Non-linear interactions at the genotype level may be less robust than those observed at the level of RNA expression which integrates both genetic and non-genetic factors.

MI candidate genes also associated with hypercholesterolemia

We performed an identical analysis of single RNAs and RNA pairs in CATHGEN using a related trait, hypercholesterolemia (defined as previous diagnosis and/or treatment of hypercholesterolemia by a physician). Six individual genes (MIA3,

CNNM2, HDAC9, LDLR, PLG, TRIB1) and 146 gene pairs with significant interaction terms in the logistic model were associated with hypercholesterolemia (Figure 16, Figure

17). The number of MI-based candidate genes differentially expressed in hypercholesterolemia is lower than in MI, and most genes differ between the traits.

Three individual genes (MIA3, CNNM2, and TRIB1) and 44 gene pairs overlap between 58 MI and hypercholesterolemia (Figure 16, Figure 17). Given hypercholesterolemia is a risk factor for MI this expected overlap suggests that these individual genes and gene pairs relate to MI through lipid metabolism.

Figure 16. Differentially expressed candidate genes in hypercholesterolemia. P-values of association between expression of probe ID (labeled by assigned gene) and MI (A) or hypercholesterolemia (B) measured in CATHGEN using logistic regression with age, race, and gender as additional covariates. C. Venn diagram displaying overlap between genes individually significant in MI and hypercholesterolemia.

59

Figure 17. Differentially expressed candidate gene pairs in hypercholesterolemia. Connected graph formed by pairs of genes with a significant interaction term in CATHGEN using a logistic model accounting for MI (A) or hypercholesterolemia (B) with expression of gene 1, expression of gene 2, expression of gene 1 * expression of gene 2, and additional covariates (age, race, gender). C. Intersection of graphs for MI and hypercholesterolemia. Pink nodes indicate genes that are individually significant in MI (see Figure 16) and orange nodes indicate genes that are individually significant in hypercholesterolemia (see Figure 16).

Discussion

The goal of this study was to find mRNA expression patterns (individual mRNAs, non-linear interactions, and paths in large-scale networks) associated with myocardial infarction. Comparison of mRNA profiles in blood from healthy controls and MI cases served to identify 14 differentially expressed candidate genes and 153 non-linear

60 interactions, while co-expression networks in blood and artery from healthy individuals in GTEx supported 31 of these interactions and brought in additional RNAs connecting potential candidate genes

mRNA expression guides interpretation of GWAS-based candidate genes

As an intermediate phenotype, gene expression can guide interpretation of GWAS findings (Gamazon et al., 2015; Gusev et al., 2016). Focusing on mRNA expression of

GWAS based candidate genes, we identified 14 individual RNAs differentially expressed in MI. FES, FURIN, PHACTR1, IL6R, RAI1, and UBE2Z were significant in both

CATHGEN and FHS while PDGFD, TRIB1, RAI1, LDLR, GGCX, and SLC22A4 had been identified previously in the literature as differentially expressed (Figure 18). The same approach applied to hypercholesterolemia identified three genes also associated with MI: MIA3, CNNM2, and TRIB1; suggesting their role in MI could be mediated through lipid metabolism. In a study of expression profiles in LCLs and B cells treated with statins, TRIB1 along with 7 other genes (LDLR, WDR12, LIPA, EDNRA, TCF21,

GUCY1A3, PDGFD), differentially expressed only in MI in our study, also exhibit differential expression upon statin exposure (p < 0.05), and therefore, may play a role in response to statin therapy (Bolotin et al., 2014).

61

Figure 18. Replication of differentially expressed genes. A. Venn diagram displaying number of differentially expressed genes identified by previous studies reporting differentially expressed genes in MI/CAD and number overlapping between studies. B. Venn diagram displaying overlap of differentially expressed genes between CATHGEN and FHS cohorts as determined by our analysis. C. Venn diagram displaying overlap of candidate genes considered in our analysis and those previously identified as differentially expressed in the literature. Gene names colored in red were identified by our analysis as differentially expressed. Note: gene names reported by each study were first converted to ENSG identifiers in order to ensure they were directly comparable. Gene names that did not map to an ENSG were not included.

Pairs of mRNAs contain information relevant to MI and are biologically plausible

Previous analyses of differential expression have focused on individual genes as the basis for assigning pathways or networks. With the hypothesis that coronary artery disease results from disrupted interactions between gene products, we considered

62 pairwise non-linear interactions between candidate mRNAs. We identified multiple pairs with interaction terms associated with MI, two of which replicated across cohorts:

CNNM2|GUCY1A3 and CNNM2|ZEB2. Analysis of odds of MI as a function of expression of these RNAs identifies robust odds ratios, supporting the role of a non-linear interaction term in logistic models evaluating disease risk. This emphasizes the potential utility of pairwise interactions between mRNA expression profiles as a biomarker for disease.

Since in addition to the non-linear interaction, we find CNNM2 and GUCY1A3 co-expressed in small-scale networks in both artery and blood, the question arises about the underlying mechanism that regulates these genes, the disruption of which may lead to

MI. A physiologic connection between CNNM2 and GUCY1A3 indeed reveals disease relevant processes. A Mg2+ transporter, CNNM2 interacts in the context of MI with

GUCY1A3, a Mg2+ dependent guanylate cyclase. Mutations in CNNM2 have been implicated in familial hypomagnesemia with symptoms including cardiac arrhythmias

(Stuiver et al., 2011). Furthermore, transient hypomagnesemia has been reported in acute

MI with vasospasm as a proposed mechanism (Sueda et al., 2001). We propose regulatory variants in CNNM2, and by implication imbalanced mRNA expression, predispose individuals to hypomagnesemia, while those in GUCY1A3 magnify downstream effects favoring spasm and MI.

The relationship between these two genes and Mg2+ is further maintained in one of the shortest paths between the two in a larger co-expression network. The path

CNNM2|ACSL5|SCARF1|GUCY1A3 appears to be united by the common themes of magnesium and lipid processing. ACSL5 binds magnesium ions and activates long-chain 63 fatty acids, while SCARF1 acts as a scavenger receptor to regulate uptake of modified

LDL – levels of which are decreased in the presence of magnesium (Kapiotis, Hermann,

Exner, Laggner, & Gmeiner, 2005). In addition to observing an association between

CNNM2 and GUCY1A3 individually with MI, the interaction term between CNNM2 and

ACSL5 is also significant (p < 0.05) indicating that relationships determined by co- expression patterns in healthy individuals can be informative of disease.

Co-expression in healthy individuals of the same pair of genes that exhibit a non- linear interaction associated with disease supports the biological plausibility of the pair.

In this study, 20% of RNA pairs with non-linear interactions associated with disease are also co-expressed in healthy subjects, indicating co-expression may serve as an additional criterion to prioritize significant interactions.

Evidence for disrupted network structure in MI

Non-linear pairwise interactions associated with MI form a well-connected graph that includes virtually all candidate genes, suggesting that relevant information about disease is dispersed across a broad network. We propose that there is a fundamental difference in detecting genes involved in MI pathophysiology by considering an additive versus a non-linear effect. Defining pairs based on an additive (rather than interactive) model reduces the number of genes included in the network by half, whereas non-linear interactions appear better to capture disease risk and may prove useful in accounting for genetic factors related to complex diseases.

Conclusions 64 Considering pairwise interactions between candidate genes reveals strong, disease-relevant pairs – a possible entry point for broad study of MI systems biology.

Our results further demonstrate that mRNA expression profiles encoded by a limited number of candidate genes yields sparse interacting networks. Serving as an anchor to extend the analysis genome-wide, we then searched for relay genes in larger networks, confirming additional candidate genes and identifying novel ones. Additional work is needed to elucidate higher order interactions and further assess potential utility of the dynamic interactions observed here in clinical biomarker panels.

65 Methods

Data

CATHeritization GENetics (CATHGEN)

Expression data, genotypes, and clinical phenotypes were acquired from

CATHeritization GENetics (CATHGEN) via dbGaP Project #5358 (dbGaP accession number phs0000703 on 25 March 2015). Expression levels had been determined using

Illumina HumanHT-12 v3 in RNA from whole blood. We considered age

(phv00197199), gender (phv00197207), and race (phv00197206) as additional covariates in our models recorded in pht003672. Myocardial infarction (MI), defined by a previous recorded history of MI (phv00197212) or a non-zero CAD index (phv00197202) recorded in pht003672, was used as a phenotype/outcome variable.

Hypercholesterolemia (phv00197204) recorded in pht003672 was also used as a phenotype/outcome variable. Our access of this study was approved by the Ohio State

University IRB (Protocol #2013H0096).

Framingham Heart Study (FHS)

Expression data, genotypes, and clinical phenotypes were acquired from the

Framingham Heart Study (FHS) via dbGaP Project #5358 (dbGaP accession number phs000007 on 21 July 2014). Expression levels had been determined using the

Affymetrix Human 1.0 ST Array in RNA from whole blood. Genotypes were taken from the SHARe substudy that used the OMNI5M genotyping array. We considered age

(phv00177930), gender (phv00177929), and family assignment (phv00024067) as covariates in our models recorded in tables pht003099 and pht000183. History of MI, 66 defined by one of the following recorded events: 1=MI recognized, with diagnostic ECG,

2=MI recognized, without diagnostic ECG, with enzymes and history, 3=MI recognized, without diagnostic ECG, with autopsy evidence, new event (see also code 9), 4=MI unrecognized, silent, 5=MI unrecognized, not silent, or 8=Questionable MI at exam 1

(variable phv00036469 recorded in table pht000309), was used as the phenotype/outcome variable. Our access of this study was approved by the Ohio State University IRB

(Protocol #2013H0096).

Genotype and Tissue Expression Project (GTEx)

Tissue specific RNAsequencing data was acquired from the Genotype and Tissue

Expression Project (GTEx) via dbGaP Project #5358 (dbGaP accession number phs000424 in April 2014). RNAsequencing had been generated using poly-adenylated priming with reads aligned to HG19, Gencodev12. For further details see Lonsdale et al. and the GTEx website (http://www.gtexportal.org/home/documentationPage)(Lonsdale et al., 2013). We considered tibial artery tissue (137 samples) and whole blood (190 samples). We used the number of reads aligned to a given transcript as the measure of expression. In each tissue, we selected only those transcripts that were non-zero in more than 65% of samples. Accordingly, in the small networks, we analyzed 132 transcripts in artery and whole blood; while in the large networks, we analyzed 8,080 transcripts in artery, and 6,705 in whole blood.

Gene list

67 Candidate genes for differential expression analysis and small co-expression networks were selected from the CAD Genome Wide Association Study performed by the CARDIOGRAMplusC4D consortium published in 2014 (Deloukas et al., 2013).

Forty-one of the candidate genes assigned to these loci had corresponding probes on expression arrays used in both CATHGEN and the Framingham Heart Study. All probes corresponding to these genes were tested for an association between mRNA expression and MI.

For the large co-expression networks, we considered transcripts included on the

Illumina HumanHT-12_V4 expression array and added any candidate genes identified through literature review and use of large databases (e.g. dbGaP, OMIM) that were not already included. We analyzed a total of 12,913 transcripts in 4,098 genes. We obtained transcripts for each gene with the aid of 'biomaRt' package in the statistical language R

(cran.r-project.org).

Association of gene expression and myocardial infarction

CATHGEN: logistic regression

Association between expression and MI (or hypercholesterolemia) was assessed using logistic regression. We considered age, race, and gender as additional covariates in the model (n = 1250; 359 cases and 891 controls). Expression was considered significant with a p-value < 0.05. Logistic regression was performed considering expression of a single gene, pairs of genes (with and without an interaction term), and paths derived from tissue-specific co-expression networks. The explicit models are outlined below.

Computations were done using the function 'glm' in the 'stats' package in R. 68

Framingham Heart Study: bootstrapped conditional logistic regression

Association between expression and MI was assessed using conditional logistic regression (CLR) in R as implemented in package ‘survival’. One hundred and ninety- three cases of MI were matched to 4,084 controls on age (+/- 2 years), gender, and different family assignment (note: all individuals in FHS are Caucasian). CLR was then performed 5000 times, each time using all 193 cases but with different subsets

(resamples) of the 4,084 matched controls. For each CLR associated with an individual case-control pairing, expression of a given gene was considered significant if the null hypothesis (effect-size of gene expression on MI status is zero) was rejected at p < 0.05 by the Wald test. For each effect (main effect as well as interaction effect) in the logistic model, the percent of CLR resamples where expression was significantly associated with

MI was reported. CLR considering expression of a single gene and pairs of genes (with and without an interaction term) was performed. The estimation of effects of individual variants and pairs of variants acting in epistasis was performed the same way as in the case of expression data. The explicit models are outlined below.

Expression of individual genes

CATHGEN: MI ~ gene expression + age + race + gender

FHS: MI ~ gene expression

Expression of pairs of genes

CATHGEN: MI ~ gene A expression + gene B expression + age + race + gender 69 MI ~ gene A expression + gene B expression + gene A expression * gene B expression + age + race + gender

FHS: MI ~ gene A expression + gene B expression

MI ~ gene A expression + gene B expression + gene A expression * gene B expression

Expression of paths (CATHGEN only)

CATHGEN: MI ~ gene A expression + gene 1 expression + gene 2 expression … gene n expression + gene B expression + gene A expression*gene 1 expression + gene 1 expression*gene 2 expression … + gene n expression*gene B expression + age + race + gender

Note: In FHS covariates (age, gender, and family) were considered by using matched case-control design.

Estimating odds of myocardial infarction for interacting gene pairs

We used the confidence bounds generated by the bootstrapped CLR in FHS to calculate the odds ratio of MI given that the expression of gene A is at its mean and increases by one standard deviation as a function of expression of gene B.

odds(Ex_A+sd(Ex_A))/odds(Ex_A)

where Ex_A and sd(Ex_A) denote the expression level and the standard deviation of expression of gene A respectively. 70

Co-expression network procedure

Network construction procedure

We evaluated the strength and robustness of co-expression patterns in a Gibbs-

Boltzmann model using Renyi divergence, with a free parameter α, as described in detail in Supplementary Methods (Kindermann & Snell, 1980). A similar approach, called

ARACNE (Margolin et al., 2006), has been successfully used in a wide variety of applications (Grimaldi, Visintainer, & Jurman, 2011). Briefly, our algorithm may be described as follows: Suppose that the set of m transcripts and tissue are fixed. We consider k= 50000 resamples of I individuals for the small networks and k=300 resamples of I individuals for the large networks. For each random subset of individuals, a weighted, undirected graph of m nodes (m being the number of transcripts) is generated. Edge weights are defined by calculating the pairwise Renyi mutual information and edges are pruned using the Data Processing Inequality (as described in the Supplementary Methods). Next with each random subset of individuals we associated an m x m adjacency matrix and based on k resamples generate the m x m average adjacency matrix ('consensus matrix') [M(i,j)]. Each (i,j)-entry is the proportion of resamples where a particular co-expression of a pair of transcripts (i,j) was observed.

The analysis was implemented in the statistical software R (cran.r-project.org), in particular the package 'infotheo' for discretizing the data (we use the method

'equalfrequencies') and the package 'minet' with function ‘aracne’ for pruning indirect interactions.

71 Preserved co-expressions

We focused on the co-expressions that have a sample frequency above 0.5 due to the following observation. Consider a direct interaction between two RNAs X_i and X_j and assume that there exists a triangle (X_i,X_j,X_l ) with an indirect interaction (say between X_i and X_l). If the true values of the Renyi mutual information for (X_i,X_j ) and for (X_i,X_l ) are arbitrarily close (but D_α (X_i,X_j )>D_α (X_i,X_l )), then the sample estimate of D_α (X_i,X_j ) should be greater than the sample estimate of D_α

(X_i,X_l ) at least half of the times. In other words, since a direct interaction between

X_i and X_j, is stronger than their indirect interaction (acting via an intermediate RNA), then the co-expression of (X_i,X_j ) is expected to appear in at least half of resamples in the consensus matrix.

72

Chapter 4: The CHRNA5/CHRNA3/CHRNB4 nicotinic receptor regulome: genomic

architecture, regulatory variants, and clinical associations

Background

Nicotinic acetylcholine receptors are ligand-gated ion channels activated by acetylcholine and nicotine, mediating synaptic transmission in the central nervous system and regulating diverse cellular functions in the periphery. Composed of five subunits encoded by nicotinic receptor genes, multiple channel compositions can exist between tissues, encompassing numerous combinations of α and β subunits. Main mediators of nicotine’s actions in the brain, α4 and β2 subtypes are prevalent in dopaminergic neurons in the ventral tegmental area that projects to nucleus accumbens or prefrontal cortex (the reward circuit), while α3 and β4 subtypes are well expressed in the habenulo- interpeduncular midbrain pathway (the aversion pathway) (Antolin-Fontes, Ables,

Görlich, & Ibañez-Tallon, 2015; Perry et al., 2002). These receptors collectively mediate complex behavioral effects of nicotine and other drugs. Both receptors can incorporate

CHRNA5 (α5) leading to altered channel properties that are further influenced by both genetic and post-translational factors (R. E. Consortium et al., 2015). Novel drug treatments targeting the CHRNA3-CHRNB4 (α3-β4) nicotinic receptor have been shown to ameliorate addictive behaviors (Cippitelli, Brunori, Gaiolini, Zaveri, & Toll, 2015;

Toll et al., 2012; Willyard, 2015). This study focuses on the CHRNA5-CHRNA3- 73 CHRNB4 locus, implicated in various clinical phenotypes including addiction (Doyle et al., 2011; Liu et al., 2010; Minicã et al., 2016; Tyndale et al., 2015; Weiss et al., 2008), while its genomic architecture and overall genetic influence remain uncertain.

The genes encoding the α5, α3 and β4 subunits are located in a single gene cluster on 15q25.1. CHRNA5 and CHRNA3 are encoded on opposite strands of the DNA oriented in opposite directions so that the 3′UTRs of some mRNA isoforms overlap, while CHRNB4 is ~3kb upstream of CHRNA3 with no overlap (Figure 19).

These nicotinic receptor genes and several other protein coding (ADAMTS7, MORF4L1,

PSMA4, IREB2, and HYKK) and non-coding genes in this region appear to be in part co- expressed with one another (A Flora et al., 2000), suggesting coordinated expression over a broad region. The locus is characterized by long frequent haplotype blocks suggestive of strong evolutionary selection pressure. Here we test the hypothesis that this genomic region is subject to regulatory processes that extend beyond single genes, forming a regulome characterized by multiple enhancer and promoter elements that interact with each other via long range DNA looping (Heidari et al., 2014; G. Li et al., 2012; Sanyal,

Lajoie, Jain, & Dekker, 2012; Wang et al., 2014; Whalen, Truty, & Pollard, 2016; Zhang et al., 2013). Understanding coordinated regulation among genes in this locus is essential for interpreting genetic associations attributed to specific variants in the cluster using genome-wide association studies (GWAS).

74

Figure 19. CHRNA5/CHRNA3/CHRNB4 nicotinic receptor locus and analysis approach. A 110 kb region (UCSC hg38 chr15:78,827,000-78,937,000) encompassing protein coding and noncoding RNAs: PSMA4, CHRNA5, RP11-650L12.2, CHRNA3 and CHRNB4. > indicates 5’ to 3’. Relevant SNPs are indicated with arrows.

Multiple variants in the CHRNA5/CHRNA3/CHRNB4 cluster have been associated with a variety of clinical phenotypes and diseases, both peripheral (cardiac arrest, lung cancer) and central (addiction, schizophrenia) (Burdett T (EBI), Hall PN

(NHGRI), Hastings E (EBI), Hindorff LA (NHGRI), Junkins HA (NHGRI), Klemm AK

(NHGRI), MacArthur J (EBI), Manolio TA (NHGRI), Morales J (EBI), n.d.; Welter et al., 2014). In this study we focus on identifying haplotypes that capture genetic variation affecting regulation in the brain, while also addressing whether regulatory processes differ between brain regions and peripheral tissues, to attain a general understanding of genetic variability in the cluster. Owing to the extended haplotype structures (Figure 35), it has proven difficult to identify the causative variants and their target genes, with some exceptions. A known nonsynonymous CHRNA5 SNP, rs16969968, and an enhancer haplotype (tagged by rs880395) on an opposite allele, have demonstrable effects on α5 75 function and expression and have been linked to nicotine addiction (Doyle et al., 2011;

Liu et al., 2010; Smith et al., 2011; Weiss et al., 2008). The effect of enhancer rs880395 was significant when considered in the context of rs16969968 (Smith et al., 2011), demonstrating relevant interactions between variants in the cluster. Both CHRNA3 and

CHRNB4 harbor potential regulatory variants that have yet to be fully characterized.

Resolving the causative variants and their interactions is critical to understanding the combined genetic influence on clinical phenotypes. In this study we use genomics databases to search for regulatory variants in long haplotypes spanning across the

CHRNA5/CHRNA3/CHRNB4 cluster and ask how these variants and haplotypes interact with each other, defining a nicotinic receptor regulome (Figure 19). We then test the clinical association of key regulatory elements to explore a new paradigm of integrated genetic influence of a gene cluster on clinical phenotypes.

Results

RNA expression across tissues in the extended gene cluster

A 500 kb window around the CHRNA5/CHRNA3/CHRNB4 cluster (GRCh38:

15:78428191-78906250) includes 20 genes (8 protein coding and 12 non-coding). Tissue specific RNA expression profiles reported in GTEx (Genotype-Tissue Expression) are summarized in Table 5. This includes genes adjacent to CHRNA5/CHRNA3/CHRNB4 that overlap with long haplotype structures (Figure 35). We excluded genes that were not detectably expressed in any tissue or were not reported in GTEx (Table 5). MORF4L1 and PSMA4 mRNAs are highly expressed across all tissues, IREB2, ADAMTS7,

CHRNA5, RP11-650L12.2, HYKK are moderately expressed, and CHRNB4, RP11- 76 160C18.2, RPL21P116, RP11-335K5.2, CHRNA3 are poorly expressed. Many genes are

well-expressed in testes. While CHRNA5 and CHRNA3 are well expressed in multiple

brain regions, CHRNB4 mRNA is well-expressed only in peripheral tissues (barely

detectable in pituitary, moderately expressed in adrenal, and robustly expressed in testes);

therefore, our subsequent analyses of CHRNB4 are restricted to peripheral tissues, even

as CHRNB4 is relevant in brain physiology.

Table 5. Tissue specific expression profile of genes in nicotinic receptor locus.

Tissues with RNA expression Gene not expressed in any tissue RP11-160C18.4, RP11-650L12.1, RNU6-415P not in GTEx RP11-160C18.5, RP11-16K12.2, RP11-335K5.3, RP11-650L12.4 universally well-expressed ADAMTS7, HYKK, MORF4L1, PSMA4, IREB2 tissue specific expression CHRNA3, CHRNA5, CHRNB4, RP11-650L12.2, RPL21P116 testes only RP11-160C18.2, RPL18P11, RP11-335K5.2

Coordinated expression between transcripts

Analysis of pairwise co-expression shows that CHRNA5/CHRNA3/CHRNB4

mRNAs and the CHRNA5 antisense RNA RP11-650L12.2 are strongly co-expressed in

most tissues - the exception being testes. Adjacent protein coding genes are not robustly

co-expressed across diverse tissues, with some exceptions. Therefore we exclude these

genes from further analysis, except for PSMA4 located within the CHRNA3/CHRNA5

haplotype (Figure 35). We focus mainly on skeletal muscle as a peripheral tissue, and on

nucleus accumbens, a brain region involved in addiction, to test correlated expression 77 between nicotinic receptor and other proximal genes. In both tissues CHRNA3,

CHRNA5, and RP11-650L12.2 are robustly co-expressed (Figure 20). In contrast to other tissues, CHRNB4 mRNA has higher testes expression than CHRNA3 and CHRNA5.

Here, CHRNB4 and CHRNA3 mRNAs are co-expressed (r= 0.39, p=1.7e-07), and

CHRNB4 is inversely correlated with CHRNA5 (r= -0.22, p= 0.0035), perhaps attributable to unique nicotinic functions in germline cells arising through spermatogenesis (Baran et al., 2015).

Figure 20. Tissue-specific mRNA co-expression. Correlation between mRNA levels (RPKM in GTEx) for CHRNA3 with CHRNA5 (left) and RP11-650L12.2 (right) in nucleus accumbens (top) and skeletal muscle (bottom). The estimated correlation and p values are: (A) 0.52, 4.0e-09; (B) 0.65, 9.8e-15; (C) 0.49, < 2.2e-16; (D) 0.50, < 2.2e-16. Each dot represents an individual sample; the line and shading represent the linear model and corresponding confidence bands. One sample was removed as an outlier from the skeletal muscle plots.

78 Expression quantitative loci (eQTLs) for nicotinic receptor and adjacent genes in the

CHRNA5/CHRNA3/CHRNB4 cluster

To identify genetic variants modulating expression patterns, we considered SNPs identified as eQTLs reported in GTEx. Power to detect eQTLs may be affected by both sample size and overall expression level, as a function of the sensitivity of RNAseq.

While some eQTLs associate with expression of only a single gene in a single tissue, other variants associate with multiple genes in multiple tissues. To determine which genes appear to be under shared genetic control of expression, we searched for common eQTLs (Figure 21). In nucleus accumbens, only three genes in the locus have identified eQTLs: RP11-650L12.2, CHRNA3, and CHRNA5. Whereas most SNPs are associated with expression of all three genes in skeletal muscle, a number of eQTLs are distinct for

CHRNA3 in nucleus accumbens. Common eQTLs can result either from linkage disequilibrium (LD) between the eQTL SNPs and a single causative haplotype, or represent more than one overlapping haplotype. In skeletal muscle, eQTLs were also identified for MOR4FL1, RP11-160C18.2, and PSMA4 (not shown), with eQTLs common between PSMA4 and CHRNA5. Sharing of eQTLs between more than one gene suggests regulatory elements that physically interact with one another via DNA looping. With hundreds of eQTLs in the gene locus, we first identify the most significant

SNP for each gene in each tissue, followed by LD analysis to determine whether a single or multiple haplotypes account for the observed eQTLs.

79

Figure 21. Overlap between eQTLs for CHRNA3, CHRNA5 and RP11-650L12.2 in nucleus accumbens (A) and skeletal muscle (B) as reported by GTEx.

CHRNA5-enhancer haplotype as ‘master regulator’ of the nicotinic cluster

Figure 22 displays all significant eQTLs for CHRNA5, CHRNA3, and antisense

RP11-650L12.2, in nucleus accumbens and skeletal muscle. The SNP with the most significant p value varied across tissues, however all of the top scoring SNPs for

CHRNA5 mRNA were in high LD (D′>0.97, R2>0.91), indicating the presence of a single CHRNA5-enhancer haplotype (Table 14). In GTEx, the CHRNA5 enhancer haplotype is the strongest eQTL for CHRNA5 in skeletal muscle (p=2.8e-91, effect size =

-1.1), but multiple other SNPs also score significantly. The tight LD structure makes it difficult to pinpoint the specific causative variant(s) affecting CHRNA5 expression; therefore, we select for further analysis the haplotype-tagging SNP rs880395, previously shown to be associated with a near fourfold increase in CHRNA5 mRNA expression in brain (Smith et al., 2011). The same CHRNA5-enhancer haplotype was also the strongest signal for RP11-650L12.2 expression across several tissues and for CHRNA3 in skeletal muscle but not in nucleus accumbens (Table 14). Similarly distributed eQTL 80 profiles for CHRNA5 and RP11-650L12.2 in nucleus accumbens, and for CHRNA5,

CHRNA3, and antisense RP11-650L12.2 in skeletal muscle (Figure 22) indicate similar genetic regulation, further supported by tight co-expression of CHRNA5 mRNA with

RP11-650L12.2 across all tissues and with CHRNA3 in skeletal muscle (Figure 20).

Overall, the CHRNA5-enhancer haplotype is associated with expression of multiple genes, having a similar direction of effect with the nicotinic genes, but an opposite effect on PSMA4 in heart, artery, skin and skeletal muscle (Table 14). However, the CHRNA3 eQTL profile differs from that of CHRNA5 and RP11-650L12.2 in the nucleus accumbens (Figure 22), suggesting distinct regulation in this brain region.

Figure 22. Local Manhattan plot of eQTLs in nicotinic receptor locus. Y-axis shows – log(p-value) in nucleus accumbens (A) and skeletal muscle (B) for SNPs that significantly associate with expression of CHRNA3 (red), CHRNA5 (green), and RP11- 650L12.2 (antisense to CHRNA5, blue), as reported by GTEx. eQTLs are plotted by genomic location with relative gene positions indicated on x axis (RP = RP-11650L12.2). eQTLs of interest are highlighted with a diamond: rs880395(orange), rs1948(red) and rs16969968(purple; nonsynonymous SNP also an eQTL). In nucleus accumbens, rs16969968 is not a significant eQTL.

81 Unraveling potential haplotypes carrying regulatory elements

Many of the identified eQTLs in the CHRNA5/CHRNA3/CHRNB4 cluster are in high LD and represent large haplotype blocks that extend over 200 kb. Assuming one causative haplotype in the region, the p-value for each eQTL should correlate with its LD

(R2) to the causative haplotype marked by the highest scoring SNP. Plotting the tissue specific eQTL p-values against the LD R2 values relative to the top scoring eQTL for

CHRNA5 and CHRNA3 in skeletal muscle and nucleus accumbens, and for CHRNB4 in esophagus and testis (Figure 12, Figure 36) reveals strong correlations. This indicates that a single haplotype accounts for most of the observed genetic effect on mRNA expression in these tissues. The CHRNA5-enhancer haplotype affects CHRNA3,

CHRNA5, and RP11-650L12.2 RNA expression in muscle and other tissues, while in nucleus accumbens, the CHRNA3-enhancer haplotype affects CHRNA3 and RP11-

650L12.2 RNA.

82

Figure 23. LD accounts for CHRNA3 eQTL strength. Correlation between –log(p-value) and LD (R2) with the top eQTL haplotype in nucleus accumbens (rs1948, correlation r2 = 0.68, A) and skeletal muscle (rs880395, correlation r2 = 0.92, B) for CHRNA3. Colored diamonds highlight CHRNA5-enhancer rs880395 (gold), nonsynonymous CHRNA5 rs16969968 (purple), CHRNA3-enhancer(brain) rs1948 (red), and a potential additional regulatory haplotype for CHRNA3 tagged by rs8042374 (blue). In nucleus accumbens, rs16969968 and rs8042374 are not significant eQTLs. While the non- synonymous rs16969968 is listed as an eQTL, this result appears to be attributable to LD with the enhancer haplotypes.

Only three brain regions show significant eQTLs for CHRNA3 mRNA, all within the basal ganglia (caudate, nucleus accumbens and putamen). While the rank order of the top SNP varies somewhat between different regions in the basal ganglia, all are in high

LD with the eQTL signal marked by rs1948 (CHRNA3-enhancer haplotype). High correlation of eQTL p values and LD relative to rs1948 indicates that only one haplotype is the main genetic factor for CHRNA3 expression in nucleus accumbens (Figure 23).

Other eQTL p values are largely attributable to varying degrees of LD with rs1948.

Located in the 3′UTR of CHRNB4, associations of rs1948 may have been falsely attributed to CHRNB4. This SNP, rs1948, is associated with higher CHRNA3 expression

83 in the nucleus accumbens (p=2.5e-17, effect size =0.99), a finding we test further in the context of nicotine dependence.

As an association between rs1948 and CHRNA3 expression in the brain is novel, we attempted to replicate this pattern in a separate database, Braineac (The Brain eQTL

Almanac), reporting eQTLs for 10 brain regions. Of basal ganglia tissues, only putamen is represented in Braineac. As in GTEx, rs1948 is strongly associated with CHRNA3 expression (Figure 37), only in putamen and none of the other 9 brain regions measured.

This association is also supported by considering allelic mRNA ratios measured in GTEx in nucleus accumbens (Figure 38).

For CHRNA5 and RP11-650L12.2 in nucleus accumbens and CHRNA3 in skeletal muscle, one additional eQTL cluster marked by rs8042374 is more significant than predicted by its LD with the CHRNA5-enhancer (blue diamond in Figure 23 and Figure

36). A SNP identified as both an eQTL and GWAS hit, rs8042374 is in high LD with a well-characterized SNP, rs6495309 (R2=0.85, D′=1) reported to reduce transcription factor binding affinity in vitro and also implicated by GWAS as a risk factor for COPD

(Lee et al., 2012; C. Wu et al., 2009). Owing to a relatively less significant p value for the eQTL cluster, we did not investigate this further.

Distinct eQTLs are associated with CHRNB4 mRNA expression in two tissues with high expression, marked by different SNPs in partial LD. In testes the top-scoring

SNP is rs67292883 and in esophagus rs4886580 (R2=0.23, D′=1.00) (Figure 36). As

CHRNB4 resides in a haplotype block separate from CHRNA5 and CHRNA3, neither of the top CHRNB4 eQTLs are in LD with the CHRNA5-enhancer haplotype or the

CHRNA3-enhancer haplotype. Whether these variants are active in the brain remains to 84 be determined. Allelic ratios clearly indicate the presence of a cis-acting regulatory variant affecting CHRNB4 expression (Figure 38).

Interaction between CHRNA5 enhancer haplotype and nonsynonymous CHRNA5

rs16969968

The CHRNA5-enhancer haplotype tagged by rs880395 and the non-synonymous risk variant in CHRNA5 rs16969968 reside almost entirely on opposite haplotypes. An acknowledged risk factor in nicotine addiction, the minor rs16969968 allele alters the physiological properties of the α5/α4 channel, by replacing the wild-type α5 subunit.

However, in individuals heterozygous for both rs16969968 and rs880395, the minor enhancer allele of rs880395 increases fourfold the expression of the main allele of rs16969985 (Smith et al., 2011),thereby diluting the relative expression of the minor rs16969985 risk allele. Shown in Figure 24 and Figure 38, changes in CHRNA5 mRNA expression are driven largely by rs880395 rather than by rs16969968 (which also is an eQTL in GTEx, likely as a result of LD with enhancer variants). Therefore, we expect that the presence of the CHRNA5-enhancer haplotype diminishes the impact of rs16969968, as previously reported (Smith et al., 2011), while increased expression of

CHRNA5 itself could alter nicotine dependence.

85

Figure 24. Interaction between nonsynonymous CHRNA5 SNP rs16969968 and CHRNA5-enhancer variant rs880395 in determining gene expression. The Y-axis shows rank normalized gene expression of CHRNA5 mRNA in skeletal muscle. Groups on the x-axis are divided by rs16969968 genotype. Total N = 361, for each boxplot N = 36-89. Differences between groups were measured using an ANOVA and Tukey test; **** denotes p = 3.8e-46 and comparing black boxes (GG genotype for rs880395 across different genotypes for rs16969968) denotes p = 0.0011.

Overlapping eQTLs, significant GWAS SNPs, and linkage disequilibrium

To extend the relationships between eQTLs and GWAS hits, as a function of LD, we have created a script to extract data from dbGaP, GTEx, and the 1,000 Genomes

Project, overlay them, and visualize relationships between LD and functional annotations.

In the resulting heatmap representing LD between candidate variants, SNPs are not ordered by genomic location but clustered by haplotypes defined by R2 (Figure 25).

Candidate SNPs were selected for (1) robust association with gene expression (top

86 eQTLs), (2) association with expression of all three nicotinic receptor genes and the tightly co-expressed non-coding RNA RP11-650L12.2, (3) identification as both eQTLs and GWAS hits or strong LD with overlapping SNPs and regulatory potential (see above), (4) association with allelic ratios reported by GTEx, or SNPs showing imbalanced allelic ratios in DNAse hypersensitivity sites (Maurano et al., 2015), and (5) frequent reports of phenotype associations in the literature (Table 15).

87

Figure 25. LD between candidate SNPs with expression, regulatory, and clinical associations annotated. Forty-seven SNPs prioritized based on GWAS hits, eQTLs, allelic imbalance and literature review are grouped by their LD (R2) in CEU population. Dendrogram on the top depicts the relatedness between groups of SNPs. Colored bars at the top indicate associations with gene expression, clinical traits, and proximity to regulatory marks.

88 This approach identifies four main haplotype blocks in Caucasians (Figure 25), each marked by a SNP discussed earlier. The non-synonymous CHRNA5 (rs16969968) and CHRNA5- enhancer (rs880395) reside in different linkage blocks (reflective of their negative LD), while the CHRNA3-enhancer (rs1948) shows moderate LD with the

CHRNA5-enhancer (indicated in orange). The additional signal (rs8042374) associated with expression of CHRNA5, CHRNA3, and RP11-650L12.2 independent from the

CHRNA5- and CHRNA3-enhancer haplotypes identified above remains a separate block.

The CHRNB4 eQTLs also reside in distinct blocks, indicating that the detectable signal for CHRNB4 is separate from that for CHRNA3 and CHRNA5. These results indicate that several haplotype blocks may be relevant to clinical associations, in addition to the

CHRNA3- and CHRNA5-enhancer haplotypes.

Preserved linkage blocks across populations narrows search for functional variants

Tight LD between variants in haplotypes marked by the candidate SNPs in Figure

25 makes it difficult to pinpoint causative variant(s). To address this further, we consider

LD blocks of main candidate SNPs (Table 16) across 15 different populations in the 1000

Genomes project. All LD blocks are smaller in an African population (YRI) (Figure 39).

The large LD block of the CHRNA5-enhancer (marked by rs880395) in CEU with 8

SNPs in perfect LD (R2 = 1; D′ = 1) is reduced to 3 SNPs (rs880395, rs905740, and rs7164030) in strong linkage disequilibrium (R2 > 0.9) in all 15 populations considered.

Using rs1979907 as an anchor for the CHRNA5-enhancer haplotype, we find a block of 6

SNPs (rs8053, rs1979907, rs1979906, rs1979905, rs1504548, rs1979908) in strong linkage disequilibrium (R2 > 0.9) again in all populations. We suggest these 9 SNPs are 89 the strongest candidates for potentially functional variant(s) within the CHRNA5- enhancer haplotype, but the eQTL and GWAS data alone cannot identify the causative variant(s).

The linkage block for the CHRNA3-enhancer (marked by rs1948) is substantially smaller with no SNPs in perfect LD and only 5 SNPs in strong LD (R2 > 0.9): rs8023462, rs4887070, rs2904130, rs1948, rs62010332. The variant rs4887070 is the best preserved with rs1948 across populations, being in strong LD (R2 > 0.9) in 10 populations.

The LD block for the nonsynonymous CHRNA5 variant (rs16969968) is again relatively large with 13 SNPs in perfect LD in CEU, and five SNPs (rs72740964, rs16969968, rs8192482, rs4887067, rs146009840) preserved across 14 of the 15 populations (excluding YRI where the minor allele frequency is low (Table 16)). The protein-altering function of the nonsynonymous SNP suggests that it is itself the main causative variant.

Association of nonsynonymous CHRNA5, CHRNA5-enhancer, and CHRNA3-enhancer

variants with nicotine dependence – an analysis of single variant effects, interactions,

haplotype, and diplotype effects

Our results reveal the presence of several regulatory variants within the

CHRNA5/CHRNA3/CHRNB4 cluster, with two well defined haplotypes marked by rs880395 (CHRNA5-enhancer) and rs1948 (CHRNA3-enhancer), the latter associated with CHRNA3 mRNA expression in the basal ganglia. Both rs880395 and rs1948 are in partial LD with each other and in negative LD with the nonsynonymous CHRNA5 SNP 90 rs16969968. Here we address the combined effects of these three variants on nicotine dependence in a 3,601 person Caucasian cohort from SAGE (Study of Addiction:

Genetics and Environment) and the Genetic Architecture of Smoking and Smoking

Cessation. Nicotine dependence was defined by smoking history and the Fagerstrom

Test for Nicotine Dependence (FTND) score (range 0-10, ND ≥4). The CHRNA5- enhancer rs880395 was not directly genotyped, but is well marked by rs8053 (R2 =0.97;

D′=1); to simplify, we refer to this SNP as rs880395.

Testing each SNP for its association with nicotine dependence using ANOVA confirmed that rs16969968 is a risk factor as reported (Bierut et al., 2008; Smith et al.,

2011). Although rs880395 and rs1948 are not individually significant, both are significant in the context of a second variant – rs1948 is significant when considered with rs16969968 and rs880395 when considered with rs1948 (Table 6). The estimated effect of rs16969968 is greatest in a recessive model (dominant model: estimate 0.26, p=0.001; recessive model: estimate 0.5, p=0.0002), possibly because presence of the minor risk allele precludes the presence of the CHRNA5-enhancer allele, which could dilute the rs16969968 effect. To estimate the effect size and significance of each SNP, we consider the most significant models from the ANOVA. Consistent with the literature, rs16969968 is associated with risk of nicotine dependence, rs880395 is protective

(negative effect size), and rs1948 is associated with risk (Table 7). Owing to the haplotype structure of the CHRNA5/CHRNA3/CHRNB4 cluster, presence of one variant affects the frequency and effect of the others.

91 Table 6. Analysis of variance (ANOVA) comparing ability of models with different SNP combinations to account for nicotine dependence. Note: SNPs were considered in recessive model.

Variable of ANOVA Model 1 Model 2 interest p value rs16969968 6.0e-06 ND ~ sex + age ND ~ sex + age + rs16969968 rs880395 0.16 ND ~ sex + age ND ~ sex + age + rs880395 rs1948 0.92 ND ~ sex + age ND ~ rs1948 + sex + age rs880395 0.11 ND ~ rs16969968 + sex ND ~ rs16969968 + rs880395 in context of + age + sex + age rs16969968 rs880395 0.04 ND ~ rs1948 + sex + age ND ~ rs1948 + rs880395 + in context of sex + age rs1948 rs1948 0.12 ND ~ sex + age_int + ND ~ sex + age_int + in context of rs880395 rs880395 + rs1948 rs880395 rs1948 0.007 ND ~ rs16969968 + sex ND ~ rs16969968 + rs1948 + in context of + age sex + age rs16969968 rs16969968 4.3e-06 ND ~ sex + age_int + ND ~ sex + age_int + in the context rs880395 rs16969968 + rs880395 of rs880395 rs16969968 1.4e-07 ND ~ sex + age_int + ND ~ sex + age_int + in the context rs1948 rs16969968 + rs1948 of rs1948

Table 7. Linear models show protective effect for rs880395 and risk for both rs1948 and rs16969968 for nicotidne dependence (ND).

rs16969968 rs880395 rs1948 ND ~ rs16969968 + sex + age 0.27a (7.5e-06)b ND ~ rs16969968 + rs1948 + sex + age 0.36 a 0.19 a (1.6e-07) b (0.007) b ND ~ rs880395 + rs1948 + sex + age -0.17 a 0.13 a (0.04) b (0.12) b

To evaluate how these SNPs in combination account for nicotine dependence, we consider haplotypes. Because of the LD relationships between rs16969968 (G/A),

92 rs880395 (G/A) and rs1948 (G/A) (alleles always presented in this order), haplotypes can be assigned in a majority of subjects with high confidence (Figure 26). The GGG ‘wild- type’ haplotype represents only 20% of subjects while AGG is most prevalent in

Caucasians (35%). A simple allele test reveals significant differences in risk effect between haplotypes, with AGG, the non-synonymous SNP on the background of the two wild type enhancers, conveying risk (OR 1.4; p=0.03).

Figure 26. Association of nicotine dependence with rs16969968 genotype as well as haplotypes and diplotypes comprised of: rs16969968. rs880395 and rs1948 reported as odds ratio with 95% confidence intervals. Significance levels denoted by p-value *<0.05, **<0.01. Frequency of each group is indicated on far left (% rounded to nearest integer).

Because of the LD structure, the number of prevalent diplotypes is also reduced, enabling assessment of diplotype risk factors. Thirteen diplotypes with frequencies of 1-

23%, most assigned unambiguously, account for >99% of the subjects. The GGG-GGG

93 diplotype is represented in only 5% of subjects with an assigned odds ratio of 1; odds ratios of the other diplotypes range from 0.8 to 2.7 (Figure 26). Consistent with the haplotype analysis, the AGG-AGG diplotype (13%) with the nonsynonymous variant on the background of the major alleles for both enhancers conveys risk with an odds ratio of

1.9 (p = 0.004). The odds ratio for the nonsynonymous variant on the background of the

CHRNA5-enhancer, the AGG-GAG diplotype, is not significant (p = 0.76); although carriers of the A allele of rs8803965 tend to cluster in the lower risk groups. On the other hand, carriers of the A allele of rs1948 tend to experience higher risk, with the highest odds ratio of 2.7 (p = 0.02) for AGG-GGA. The results support a protective role for rs880395 in this context, and suggest that the rs1948 A alleles causing high expression of

CHRNA3 mRNA contribute to risk. The results show that each of the three variants exerts robust effects that depend on the context of the haplotype/diplotype, generating a gradient of risk that is not captured by rs16969968 alone (Figure 26).

Discussion

Our goal was to discover regulatory variants in the CHRNA5/CHRNA3/CHRNB4 cluster, determine the haplotype structures, evaluate phenotypic associations, and test the combined influence of genetic variation on nicotine dependence, under the hypothesis that the cluster has evolved as an interactive regulome. The analyses take advantage of large gene expression data, genome-wide association studies, LD structures, and functional chromatin associations, identifying at least 4 haplotypes carrying regulatory variants, in addition to the non-synonymous SNP rs16969968, a known smoking risk factor. Expression levels and genetic variants (eQTLs), associated with each of the three 94 genes and the CHRNA5 antisense gene RP11-650L12.2, varied substantially between central and peripheral tissue, but co-expression patterns suggest coordinated regulation.

Frequent eQTLs for CHRNA5, CHRNA3, and RP11-650L12.2 reside on long overlapping haplotypes, a sign of strong evolutionary selection pressures, while eQTLs for CHRNB4 are separate and were not detectable in brain regions because of low mRNA levels. This study focuses on two main eQTL haplotypes regulating expression of CHRNA5,

CHRNA3, and RP11-650L12.2, and on their influence together with rs16969968 on nicotine dependence.

CHRNA5-enhancer and CHRNA3-enhancer haplotypes

We show here that a previously reported CHRNA5-enhancer haplotype tagged by rs880395 not only associates with a robust fourfold increase in CHRNA5 mRNA expression in brain (Smith et al., 2011) and peripheral tissues, but also enhances expression of CHRNA3 and RP11-650L12.2 RNA while reducing PSMA4 mRNA expression in most tissues. Robust co-expression of antisense RNAs can have multiple effects, for example by recruiting the histone modifying machinery reported for the

APOA1 locus (Halley et al., 2014). Our result indicates that the CHRNA5-enhancer serves as a master regulator, likely affecting the expression of multiple genes via DNA looping processes (Heidari et al., 2014; G. Li et al., 2012; Sanyal et al., 2012; Wang et al., 2014; Whalen et al., 2016; Zhang et al., 2013).

However, in the basal ganglia of the brain, a distinct haplotype tagged by rs1948 appears to boost CHRNA3 mRNA expression (the CHRNA3-enhancer haplotype), demonstrating distinct regulation in brain regions germane to nicotine dependence. The 95 CHRNA5-enhancer and CHRNA3-enhancer haplotypes are in partial LD with each other, a critical feature that needs to be reflected in GWAS analyses. In contrast, the nonsynonymous SNP rs16969968 resides nearly exclusively on a haplotype opposite to that of the CHRNA5-enhancer haplotype. As a result, the effect of the minor allele of rs16969968 is diluted in heterozygous carriers if the main allele carries the CHRNA5- enhancer haplotype with high expression.

Genetic effects of rs16969968 and the CHRNA5- and CHRNA3-enhancer haplotypes

Previous studies have demonstrated that the nonsynonymous CHRNA5 SNP rs16969968 (D398N) reduces response to nicotine (Bierut et al., 2008), causing desensitization and reducing Ca2+ current upon activation in (α4β2)2α5 complexes but not (α3β4)2α5 or (α3β2)2α5 (Kuryatov, Berrettini, & Lindstrom, 2011). The SNP rs16969968 does not appear to affect gene expression per se, even though it is assigned an eQTL in GTEx, likely a result of the intricate LD structure of the locus.

Previous studies of a CHRNA5-enhancer region overlapping with the haplotype containing rs880395 report repressor activity in BE(2)-C cells in a luciferase assay

(Doyle et al., 2011), but no effect of the minor haplotype alleles. Others have proposed that the CHRNA5-enhancer region contains an RE-1 Silencing Transcription Factor binding site (Ramsay, Rhodes, Thirtamara-Rajamani, & Smith, 2015), while again failing to detect an allele effect. Either a larger region is required to properly assemble the transcription factor complex (and thus generate the SNP-dependent effect observed in vivo), or the studied portion of the CHRNA5-enhancer haplotype is not the active enhancer region. 96 The CHRNA3-enhancer haplotype active in the basal ganglia tagged by rs1948 resides in the 3′UTR of CHRNB4. Our eQTL analysis demonstrates that the rs1948- tagged haplotype regulates CHRNA3 mRNA expression in this brain region. Previous studies have reported associations between rs1948 alcohol and tobacco use (Schlaepfer,

Hoft, & Ehringer, 2008; Stephens et al., 2013) and blood pressure (Kaakinen, Ducci,

Sillanpää, Läärä, & Järvelin, 2012). While Flora et al. (Amber V Flora et al., 2013) found no effect of rs1948 on TF binding in vitro using EMSA, rs1948 alters transcriptional activity in a luciferase assay in H446 cells, a lung cancer cell line, but not in neuronal cells (SH-SY5Y or PC-12). Therefore, the causative regulatory variant remains to be identified. We cannot exclude the possibility that additional regulatory variants contribute to expression across the CHRNA5/CHRNA3/CHRNB4 cluster.

Interactions between the CHRNA5-enhancer and CHRNA3-enhancer haplotypes with

rs16969968 affecting nicotine dependence

While rs16969968 is a robust risk factor as reported, neither of the CHRNA5- and

CHRNA3-enhancer haplotypes represented by rs880395 and rs1948, respectively, have significant effects on nicotine dependence when tested in isolation. Using GWAS results on nicotine dependence we show here that each of these variants interact with each other, and with rs16969968, an effect missed with separate analysis of individual variants. The

CHRNA3-enhancer SNP rs1948 also appears to convey risk, and the CHRNA5-enhancer

SNP rs880395 is protective. To assess the combined impact on nicotine dependence, we calculate haplotype effects. Because of the extensive LD structure, only a limited number of haplotypes are extant in the population. The CHRNA5-enhancer haplotype 97 alters nicotinic receptor composition while also reducing the relative contribution of the minor risk allele of rs16969968. The CHRNA3-enhancer haplotype could affect nicotine dependence by altering receptor composition specifically in regions of the basal ganglia.

As a haplotype (in order rs16969968 (G>A) – rs880395 (G>A) – rs1948 (G>A)), AGG carries the greatest risk. We also can assign diplotypes to 13 groups accounting for >99% of the GWAS cohort (Caucasians), which convey risk in a gradual fashion from lowest

(GGG-GAG) to highest (AGG-AGG and AGG-GGA). The minor allele of rs1696996 is uncommon (MAF 1–5%) in Asian and African populations. Therefore, the enhancer haplotypes and possibly additional variants could be more important in these populations.

The risk assignments with use of diplotypes, while similar in some compositions to rs16969968 genotype alone, differ substantially for other diplotypes, such as the high risk

GGG-GGA, which would be classified by rs16969968 genotype alone (GG) as being low risk (Figure 26). Since addictive behaviors are influenced by multiple gene loci that are typically added into a composite risk score, even small errors in risk assessment at a given gene locus can lead to substantial errors in the overall score.

Conclusions

In conclusion, our results show that a genomic region with functionally related genes, such as the CHRNA5/CHRNA3/CHRNB4 cluster, is under coordinated regulatory control, revealed by co-expression analysis and eQTL patterns. As a result, clinical association studies employing individual genetic variants separate from each other are bound to miss a critical portion of the overall genetic variability conveyed by this type of gene cluster. Our approach suggests re-interpretation of genetic effects of the 98 CHRNA5/CHRNA3/CHRNB4 cluster and serves as a roadmap for studying the

‘regulome’ of such gene clusters.

99 Methods

Datasets

GWAS and mRNA expression data generated by other institutions was downloaded from dbGaP. This project is compliant with the regulations of the Ohio State

University (OSU) Institutional Review Board (IRB) and operates under a protocol reviewed and approved by a duly constituted ethics committee (OSU IRB).

Genotype and Tissue Expression Project (GTEx)

Tissue specific expression, eQTLs, and allelic ratios. We acquired tissue specific

RNA sequencing data generated by GTEx via dbGaP Project #5358 through the dbGaP repository, accession number phs000424.v5.p1 in April 2014 and the GTEx Portal in

January-May 2016. RNA libraries had been generated using poly-adenylated priming with reads aligned to HG19, Gencodev12. GTEx reports allelic ratios as number of reads for the reference allele over total number of reads for that heterozygous SNP. Tissue specific eQTLs were calculated by GTEx using normalized RPKM (Reads Per Kilobase of transcript per Million mapped reads) and imputed genotypes with PEER factors and gender as covariates. See (Lonsdale et al., 2013) and

(http://www.gtexportal.org/home/documentationPage).

GWAS Catalog & GRASP

General clinical associations were derived from the NHGRI-EBI GWAS Catalog

(https://www.ebi.ac.uk/gwas/) and the Genome-Wide Repository of Associations

100 Between SNPs and Phenotypes (GRASP) database

(http://grasp.nhlbi.nih.gov/Overview.aspx).

SAGE & The Genetic Architecture of Smoking and Smoking Cessation

Nicotine dependence GWAS data: The Study of Addiction: Genetics and

Environment (SAGE) (phs000092.v1.p1) includes COGA (Collaborative Study on

Genetics of Alcoholism), FSCD (Family Study on Cocaine Dependence) and COGEND

(Collaborative Genetic Study of nicotine dependence). The Genetic Architecture of

Smoking and Smoking Cessation (phs000404.v1.p1) includes additional COGEND subjects and UW-TTURC (University of Wisconsin Transdisciplinary Tobacco Use

Research Center). There is a lack of gender balance in the study population; therefore we cannot make conclusions regarding the impact of gender, although it was included as a covariate in our models.

Linkage disequilibrium

LD plots for the entire 500 kb region were generated with Haploview v 4.2

HapMap Download Version 3, Release R2. R2 and D′ between candidate SNPs were calculated using the ‘ld’ function in the ‘snpStats’ package implemented in R. Linkage disequilibrium was calculated based on imputed genotypes reported by GTEx in

Caucasians only (n = 377) for plots of R2 versus p value. To identify linkage blocks preserved across populations, 1000 Genomes population specific genotyping was used.

Overlapping eQTLs and significant GWAS SNPs

101 To integrate molecular and clinical associations, we determined which SNPs in the 500 kb region surrounding the nicotinic cluster were both eQTLs for the

CHRNA5/CHRNA3/CHRNB4 cluster (in any tissue as reported by GTEx) and GWAS hits

(with any clinical phenotype in GRASP). Multiple SNPs serve as eQTLs and GWAS hits, suggesting that regulatory variants account at least in part for clinical associations.

This query resulted in 11 SNPs (rs1051730, rs11858836, rs12914385, rs13180, rs17486278, rs1994016, rs2036527, rs2219939, rs3825807, rs4380028, rs55958997, rs667282, rs7173743, rs8034191, rs8040868, rs8042238, rs8042374, rs899997) that had been assigned to all four genes of interest (CHRNA3, CHRNA5, CHRNB4, and RP11-

160C18.2), as both eQTLs and GWAS hits. Including all SNPs in perfect LD (R2 = 1, D′

= 1) with these 11 variants, we then filtered with biomaRt in R for those falling in regulatory regions defined by ENSEMBL (ENSRs) resulting in 17 SNPs. These show some degree of linkage disequilibrium with the candidate SNPs identified and were included with other candidate SNPs in Figure 6.

Statistics

ANOVA with covariates matching those in GTEx, followed by Tukey multiple comparisons of means (95% family-wise confidence level) was used to compare means between genotype groups. Correlations (including tissue-specific co-expression (RPKM) between genes) were tested using the ‘cor.test’ function in R.

Testing clinical significance of identified variants

102 We use FTND score (Fagerström Test for Nicotine Dependence) to define nicotine dependence as ≥4. The analysis was restricted to Caucasians and only included subjects who smoked at least 100 lifetime cigarettes. The strict definition for nicotine independence (having significant smoking exposure but no evidence of dependence) serves to isolate the phenotype of dependence but also limits the size of our control population (3:1 ratio for cases (N=2756) and controls (N=845)). Haplotype and diplotype probabilities were generated with SVS Golden Helix SNP & Variation Suite v

8.3.4 using the Expectation Maximization (EM) algorithm. The most likely haplotypes and diplotypes were used as predictors of nicotine dependence with age and gender included as covariates. Odds ratios and confidence intervals were calculated from the estimated effect size and standard error in the linear model. These results were further compared to ones obtained via the ‘BayHap’ (Bayesian analysis of haplotype association using Markov Chain Monte Carlo) package in R. Additive logistic models were compared using ANOVA with a likelihood ratio test (LRT) implemented in R. GLM p- values are reported for the standard Wald’s test and in ANOVA are reported for LRT

(unless otherwise stated).

103

Chapter 5: Genetic variation in the cholesteryl ester transfer protein locus

Background

Heart disease is the leading cause of death in the United States with nearly

750,000 instances of myocardial infarction (MI) and 600,000 deaths annually (Go). Risk of MI has been associated with high LDL levels and low HDL levels in epidemiological studies; accordingly, lipid levels have become a primary target for disease prevention.

As a first line therapy, statins lower LDL levels and are the most prescribed class of medication in the United States, taken by ~25 million Americans (Mann, Reynolds,

Smith, & Muntner, 2008). Despite widespread use, only half of those taking statins are protected from MI even with lower lipid levels, leaving 12 million Americans at risk for a heart attack (Arca & Gaspardone, 2007). Therapies that increase HDL, instead of lowering LDL, have been proposed as an adjunct to statins. Several pharmaceutical companies have targeted Cholesteryl Ester Transfer Protein (CETP), which functions in reverse cholesterol transport exchanging cholesteryl esters for triglycerides between HDL and LDL particles in the bloodstream, thereby, decreasing HDL (Hesler, Swenson, &

Tall, 1987). Early clinical trials of CETP inhibitors show conflicting results with clinical trials for torcetrapib being terminated early due to increased rates of cardiovascular and all-cause mortality, perhaps attributable to off-target effects (Barter et al., 2007; Schwartz

104 et al., 2012) and trials for both evacetrapib and dorcetrapib terminated early due to lack of efficacy.

Genetic variants in CETP have been consistently associated with lipid levels in general and HDL in particular, traits well documented for their relationship with MI. The association of CETP variants with MI is however more tenuous, with some variants demonstrating decreased and others increased risk (Agerholm-Larsen, Nordestgaard,

Steffensen, Jensen, & Tybjaerg-Hansen, 2000; Inazu et al., 1990; Nagano et al., 2004;

Papp et al., 2012). Additionally, some evidence suggests variants in CETP may account for differential response to statins (Boekholdt et al., 2005; Kuivenhoven et al., 1998;

Leusink et al., 2014; Regieli et al., 2008). In contrast to the well-acknowledged role in lipid metabolism, additional evidence links CETP to immunity. Upregulation of CETP in response to LPS, high expression in the spleen, and structural homology with genes known for their role in innate immunity (Lipopolysaccharide Binding Protein,

Bactericidal Permeability Increasing Protein) all indicate CETP (and variants it harbors) may play an important role in immunity and inflammation (Azzam & Fessler, 2012;

Bingle et al., 2004; Ito et al., 2004; Van Eck et al., 2007).

Our group (among others) has identified CETP variants associated with reduced expression and aberrant splicing in liver tissue (Boekholdt et al., 2005; Papp et al., 2012;

Suhy et al., 2014). While these variants are associated with changes in blood lipids, associations with clinical traits such as MI remain unclear. Nevertheless, CETP appears to be critical, with too low and too high levels deleterious, and haplotype structures that suggest positive evolutionary pressures for selection of multiple variants modulating the encoded protein's functions. Hence, a lack of significant association with MI or treatment 105 response could have resulted from incomplete knowledge of all relevant CETP variants and gene functions. Here we test the relationship between CETP mRNA levels, variants, hypercholesterolemia, and MI using CathGen. We evaluate tissue specific CETP expression and co-expression patterns using GTEx and place these in the context of genes known to be involved in lipid metabolism and immunity.

Results

Frequent finding in GWAS

Variants in and around CETP have been identified for 15 different traits by more than 40 GWAS studies suggesting the locus harbors genetic variants important for a number of clinical phenotypes (Burdett T (EBI), Hall PN (NHGRI), Hastings E (EBI),

Hindorff LA (NHGRI), Junkins HA (NHGRI), Klemm AK (NHGRI), MacArthur J

(EBI), Manolio TA (NHGRI), Morales J (EBI), n.d.; Welter et al., 2014). It is one of the densest regions of the genome for GWAS associations (Johnson et al., 2009), emphasizing the need for targeted studies to identify functional variants. CETP is strongly implicated in lipid metabolism as one of just four genes (CETP, TRIB1, FADS-

1-2-3, APOA1) associated with all four lipid traits (triglcyerdies, HDL, LDL, total cholesterol) considered by the Lipids GWAS Consortium (Willer et al., 2013). These associations replicate across GWAS studies of different racial and ethnic groups, highlighting the significance of this locus and the potential for broad applicability (Braun et al., 2012; Radovica et al., 2013; Schierer et al., 2012).

Using a publically available database that curates GWAS results (GRASP-DB), for a given study and a given clinical trait, we consider the correlation between the p- 106 value for association of a variant with the trait to its linkage disequilibrium with the top scoring SNP in the locus (Figure 40). If the GWAS association is driven by a single functional haplotype, we expect the strength of association for each SNP to correlate with linkage disequilibrium to the top-scoring variant (most significant p-value). For all four lipid traits: HDL, LDL, total cholesterol, and triglycerides, we identify SNPs that are more significant than expected given their linkage disequilibrium, indicating the presence of multiple functional haplotypes.

Well-expressed in immune tissues and co-expressed with immune genes

Both the location and quantity of mRNA expression can guide our understanding of how CETP functions in normal physiology and whether this differs in a disease state.

Using GTEx, we quantified the tissue-specific expression profile of CETP. CETP is most robustly expressed in spleen tissue but also well-expressed in adipose (subcutaneous and visceral), adrenal gland, breast (mammary tissue), EBV-transformed lymphoctyes

(LCLs), kidney (cortex), liver, lung, pituitary, thyroid and whole blood (Figure 27). To interpret this profile, we compared it with expression profiles for other genes in the same gene family (PLTP, BPI, LBP), the genes neighboring CETP (HERPUD1, NLRC5), the target of statin medications (HMGCR), and a gene shown to be differentially expressed in

MI (TRIB1) (Figure 28). The expression profile of CETP shows little similarity with that of HMGCR, suggesting the potential of CETP as a drug target for dyslipidemia operates by a mechanism different than that of statins. Additionally, CETP exhibits an expression profile distinct from its flanking genes, emphasizing the independent regulation of each of these genes in the locus. 107

Figure 27. CETP best expressed in Spleen. Expression of CETP (RPKM) in different tissues as reported by GTEx. Figure generated using the GTEx Browser.

Figure 28. CETP expression profile clusters with immune-related family members LBP and BPI. Tissue denoted by colored bar at the top of the plot, for legend see Appendix H.

In a genome wide co-expression analysis conducted using RNAsequencing data of splenic tissue (see Appendix G for details of method), CETP was co-expressed with

ABCA7, ASAP1, EIF5, FAM3A, IRF5, MR1, NAGK, TMEM50B, and TSC1 (Figure

29). All but one of these genes (NAGK) has evidence for a role in immunity and 108 inflammation, while only one (ABCA7) is implicated in lipid metabolism. This suggests that in addition to its well-recognized role in lipid metabolism, CETP may also function in immune processes.

Tuberous(sclerosis'1" Eukaryotic+translation+initiation+factor+5A"

Contributes+to+production+of+cytokines+(IL"1,# Promotes(differentiation(of(memory(CD8+(T(cells(and( 6,7 TNFalpha,)etc.))in)response)to)LPS10,11 regulates)balance)between)effector)and)regulatory)T)cells TSC1% Family'with'sequence' FAM)% similarity)3,)member)A" EIF5% 3A% Cytokine"like%protein9 Interferon(regulatory( factory(5 ! Determines)whether) machrophages+promote+or+ Transmembrane) inhibit&inflammation12 TMEM Protein(50B" IRF5% 50B% Located(in( interferon/interleukin" CETP% 10#receptor#locus# (antisense(to(IFNGR2)8 ArfGAP&with&SH3& domain,(ankyrin( repeat&and&PH& domain'1" ASAP1% ABCA7% ATP!binding&cassette,& sub!family'A'(ABC1),' Implicated+in+ member%7" susceptibility+to+TB+ infection(1 Involved(in(macrophage( phagocytosis+of+apoptotic+ MR1% NAGK% cells3–5

Major&histocompatibility& ! Primary!function!in!immunity! ! ! ! ! complex(class(I!related" ! Experimental!evidence!for!role!in! Antigen'presenting!molecule13 N!acetylglucosamine.kinase immunity! ! No#known#involvement#in#immunity2 Circumstantial!evidence!for!a!role! in!immunity! ! No!evidence!for!immune!function!

Figure 29. CETP is co-expressed with immune-related genes.

Expression of CETP in blood is associated with MI

Given the inconsistent associations between CETP variants and MI in the literature as well as questions about the role of HDL, which CETP is robustly associated with, in cardiovascular disease, we began by asking if expression levels of CETP (a more

109 proximal phenotype than SNPs) are associated with MI. We found an association between CETP expression (measured in blood using microarray technology) and MI

(estimate: -0.037, p-value: 0.007) that remains even when hypercholesterolemia is considered as a covariate (estimate: -0.035, p-value: 0.01), suggesting the association between CETP and MI is at least in part independent of the gene’s role in lipid metabolism. This association is in the opposite direction (higher expression is protective) than expected given design and implementation of CETP inhibitors for treatment of coronary artery disease. Furthermore, it is observed for only one of the probe ids (ILMN_2098013) used to measure CETP expression and not a second

(ILMN_1681882; p-value 0.62).

CETP transcripts exhibit marked differences in expression

Given the difference in association between CETP expression and MI for different microarray probes and the possibility for these probes to have different affinity for CETP isoforms, we considered differences in expression between the two predominant isoforms. A full length form that is excreted from the cell and shows lipid- transfer functionality and a splice isoform lacking the 9th exon that remains in the endoplasmic reticulum and has no such lipid-transfer capabilities. In fitting a mixture model to normalized expression values in spleen, we find the full-length isoform to fall predominately in one mixture component indicative of relatively uniform expression across individuals. In contrast, the splice isoform shows large variability with approximately equal distribution into two different mixture components that have a different mean and variance. 110

Candidate SNPs in the CETP gene locus associate with CETP expression but not MI

Our group has previously identified two variants in CETP. One, rs5883, is likely a functional variant altering rates of splicing (Papp et al., 2012; Suhy et al., 2014). A second, rs247616, is a potentially functional variant associated with decreased gene expression (Suhy, Hartmann, Papp, Wang, & Sadee, 2015). It resides upstream of the gene and is strongly (but not perfectly) associated with allelic expression ratios, indicating it either acts in concert with another variant or is not itself functional. Here we consider the potential contribution of an additional SNP, rs6499863 that also resides upstream of the gene and is in strong negative linkage disequilibrium with rs247616

(Figure 30, Figure 41).

Figure 30. Candidate SNPs in CETP gene locus. rs247616 and rs6499863 are located upstream of the gene in the intergenic region between HERPUD1 and CETP. rs5883 is located in exon 9.

We begin by determining the relationship between each of these SNPs and expression of CETP. Using GTEx, we find rs247616 is associated with decreased gene expression in both spleen and liver, while rs6498863 shows borderline significance for increased gene expression in liver (Table 8). Of note, we find rs5883, which has been 111 shown previously to increase rates of splicing, associated with decreased expression of

CETP in whole blood (Table 8). This may be attributable to impaired release of the splice variant from the cell, resulting in the appearance of reduced expression. Using

CATHGEN we find rs247616 is associated with decreased expression and rs6499863 increased expression, while rs5883 is non-significant (Table 8). The significant associations for the enhancer variants in whole blood in CATHGEN but not GTEx may in part reflect the different methodologies for determining expression, as CETP is poorly expressed and arguably falls below the level of detection in all tissues but spleen in the

GTEx RNAsequencing data, not the case with targeted microarray probes.

Given the strong negative LD between rs247616 and rs6498863 and their opposite direction of effect, whether they each have an effect on expression or simply mark the absence of the other remained a question. To address this, we tested whether considering these variants in combination improved our ability to account for expression of CETP. In CATHGEN, we used an ANOVA to compare generalized linear models accounting for CETP expression with rs247616 alone versus with rs247616 and rs6499863 in combination. We find adding rs6499863 significantly improves the ability to explain CETP expression (p-value 0.015), indicating both SNPs contribute to CETP expression. When adding rs5883 to the model, we do not find an improvement (p-value

0.37), consistent with its non-significant association with CETP expression in

CATHGEN. We next tested whether these SNPs that are associated with CETP expression are also associated with MI, but find no evidence for this (Table 9).

112 Table 8. Association between candidate SNPs and CETP expression in GTEx and CATHGEN.

Study SNP Tissue Beta P-value GTEx (RNAseq) rs247616 liver -0.21 0.034 spleen -0.42 0.0016 whole blood -0.02 0.67 rs6499863 liver 0.18 0.1 spleen 0.028 0.86 whole blood -0.025 0.66 rs5883 liver -0.046 0.80 spleen -0.084 0.7 whole blood -0.24 0.012 CATHGEN rs247616 whole blood -0.12 0.02 (Illumina human rs6499863 whole blood 0.20 0.003 ht_12_v4 microarray) rs5883 whole blood 0.14 0.19

Table 9. Association between candidate SNPs and MI in CATHGEN

Study SNP Phenotype Beta P-value CATHGEN rs247616 MI -0.02 0.39 rs6499863 MI -0.02 0.45 rs5883 MI 0.04 0.35

Discussion

High expression of CETP in spleen and co-expression with immune-related genes is consistent with earlier studies in the literature that report high expression of CETP by resident macrophages in the liver and germinal B cells in lymph nodes (Ito et al., 2004;

Van Eck et al., 2007). Despite evidence of robust expression in immune tissues for over a decade, the function of CETP in these cells remains unknown. As lipid exchange is also vital for immune cells it may retain this function; however presence of the delta-9 splice isoform (associated with rs5883) that lacks lipid transfer capabilities and remains within the endoplasmic reticulum indicates potential for an additional function.

Development of a novel function for the detla-9 splice isoform is further supported by 113 expression levels that can be divided into multiple mixture components, one of which shows substantially higher variability. Multiple isoforms and different affinities of microarray probes for the two predominant isoforms may contribute to variable results in the association of gene expression with MI.

Despite evidence for an inverse association between CETP expression and MI, variants in CETP show no association. This indicates that expression captures additional information not well marked by the SNPs.

Conclusions

The expression and co-expression profile of CETP suggests it may have a currently unrecognized role in immunity or inflammation in addition to its well-known role in lipid metabolism and transport, which maybe important to consider at the level of individual isoforms. Given the crosstalk between reverse cholesterol transport and innate immunity and the role of both processes in atherosclerosis and MI, this is critical to consider in the context of CETP inhibitors that have shown poor efficacy.

114 Methods

Data

CATHeritization GENetics (CATHGEN)

Expression data, genotypes, and clinical phenotypes were acquired from

CATHeritization GENetics (CATHGEN) via dbGaP Project #5358 (dbGaP accession number phs0000703 on 25 March 2015). Expression levels had been determined using

Illumina HumanHT-12 v3 in RNA from whole blood. We considered age

(phv00197199), gender (phv00197207), and race (phv00197206) as additional covariates in our models recorded in pht003672. Myocardial infarction (MI), defined by a previous recorded history of MI (phv00197212) or a non-zero CAD index (phv00197202) recorded in pht003672, was used as a phenotype/outcome variable.

Hypercholesterolemia (phv00197204) recorded in pht003672 was also used as a covariate. Our access of this study was approved by the Ohio State University IRB

(Protocol #2013H0096).

Genotype and Tissue Expression Project (GTEx)

Tissue specific RNAsequencing data was acquired from the Genotype and Tissue

Expression Project (GTEx) via dbGaP Project #5358 (dbGaP accession number phs000424 in April 2014). RNAsequencing had been generated using poly-adenylated priming with reads aligned to HG19, Gencodev12. For further details see Lonsdale et al. and the GTEx website (http://www.gtexportal.org/home/documentationPage)(Lonsdale et al., 2013). We used the RPKM as the measure of expression. Gene expression measured

115 in RPKM was log transformed and clustered using the heatmap.2 function in the gplots package in R.

Testing clinical significance of identified variants

Additive logistic models were compared using ANOVA with a likelihood ratio test (LRT) implemented in R. Age and gender were used as covariates. The dataset was restricted to only those of Caucasian race. GLM p-values are reported for the standard

Wald’s test and in ANOVA are reported for LRT (unless otherwise stated).

Mixture model

Gene expression was normalized by taking the residuals from a glm fit with gene expression explained by covariates (PEER factors calculated by GTEx, age, gender) selected using the cv.glmnet function in the ‘glmnet’ package implened in R. A Gaussian mixture model was fit using the ‘mclust’ package.

116

Chapter 6: Concluding Remarks

In the preceding chapters, I have presented evidence for non-additive, non-linear interactions in the context of CAD. These include interactions between multiple functional haplotypes in accounting for both gene expression (LIPA, CETP) and clinical traits (nicotine dependence). It also includes interactions between gene expression patterns, that incorporate both genetic and environmental factors, as in the case of

CNNM2|GUCY1A3 (among other gene pairs) associated with MI. This work supports a role for non-additive effects on both the SNP and gene level in an ongoing debate about the relevance of epistasis for missing heritability. It paints a picture of multiple functional haplotypes in a gene locus, sometimes with opposing effects, that contribute to gene expression and clinical traits in a non-linear manner.

Interpreting GWAS signal

I have largely focused on loci implicated in CAD by GWAS, unable to test all possible interactions due to limitations in statistical power. In exploring these loci, there is often an ambiguous gene assignment. Each locus harbors multiple isoforms for a given gene. These isoforms maybe co-expressed, indicative of shared regulation by genetic and environmental factors, or differentially expressed, for example as observed with the two predominant isoforms of CETP, suggesting transcript specific regulation. These loci also 117 contain numerous non-coding and antisense RNAs. Although ncRNAs are not always accounted for in the interpretation GWAS results, they are prevalent in these loci.

Evidence presented here suggests that ncRNAs may in some cases be the relevant gene or modify the main effect. For example in the loci marked by rs72689147 and rs17087335 these SNPs are associated with expression of only the ncRNA in the locus, not the protein-coding gene they have been assigned to.

Expression can guide the search for the relevant gene, but may often implicate multiple genes. As is the case with one of the enhancer variants identified in the nicotinic receptor locus, rs880395 serves as an enhancer for CHNRA5, CHRNA3, and RP11-

650L12.2, as well as a repressor of PSMA4. These multi-gene eQTLs suggest long-range interactions (e.g via chromatin looping) or a more general mechanism that directs expression of the entire locus, for example by controlling openness of the chromatin.

This maybe the case with rs6689306 that is associated with expression of both IL6R and

TDRD10 and falls in a DNAse hypersentivity site where it associates with allelic ratios.

Such locus-wide interactions may in some cases be mediated by non-coding RNAs as was previously observed with APOA1-antisense that alters chromatin conformations, mediating expression of APOA1, APOC3, and APOA4 (Halley et al., 2014).

In addition to the gene or transcript, the relevant tissue and pathway are also not often straightforward to determine for GWAS-implicated loci, often due to tissue-specific effects. SNPs may be active only in certain tissues, for example in the nicotinic receptor cluster rs1948 is the top eQTL for CHRNA3 only in basal ganglia. Genes may also be active only in certain tissues. In addition to a simple presence or absence of an effect, the meaning or function of a particular SNP or gene may also vary by tissue. For example as 118 with CETP we may expect the predominant function in liver and adipose tissue to be lipid transfer, while a second, perhaps more immune-driven function is predominant in spleen. Often even when a relevant tissue or biological pathway is well-recognized, it may not be the exclusive tissue or pathway relevant for a given gene as highlighted not only by this work but also by the pleiotropic effect of medications like statins.

Ultimately, I think we need to alter our approach so that GWA and other genetic studies are not designed to implicate a single SNP, assigned to a single gene in risk of a single disease. As biologists, we may restrict our focus to a physical space and classify a gene as being involved in a particular pathway, but the body doesn’t restrict it to that and will quickly co-opt it for another purpose if given the opportunity.

Where to look for epistasis?

Focusing on GWAS findings serves to reduce the search space, but also limits detectable interactions to those between genes or variants that have a statistically significant individual effect. Epistasis is put forth as a theory to account for the missing heritability in part because these interactions are expected to exhibit large effect sizes that make substantial contributions to disease risk. There is however debate about whether interactions built from individually significant components (i.e. GWAS hits) versus those that show no signal when considered on their own will contribute these large effects.

Although the interactions maybe governed by identical mechanisms, making this distinction to some extent arbitrary, it may prove useful because it provides a framework for discussion but more importantly because it asks a critical question: can we expect interactions between GWAS hits? In my opinion, the answer to this question is both yes 119 and no. Yes in that as is the case with diplotypes in nicotinic receptor locus presented above, we can expect that considering the relationship between variants will refine our understanding of these loci and better account for disease risk. No, in that I do not expect these particular interactions will reveal the type of robust effect sizes and large advances in addressing the missing heritability that are expected.

When we test for epistasis between marker SNPs in gene pairs that exhibit a significant, disease-associated interaction in their expression patterns, we do not find compelling examples. This is almost certainly influenced by the use of marker SNPs.

We selected the best possible markers by using eQTLs, and GWAS variants assigned to these genes, but the potential for multiple functional haplotypes not adequately captured by these variants, in combination with current limitations of mathematical modeling, likely limited our ability to resolve interactions. Despite these limitations, I think compelling results were also lacking because interactions between individually deleterious SNPs are less likely. Risk variants would be selected against and unlikely to be present on a sufficient diversity of genetic backgrounds for such robust interactions to arise.

In light of the discussion about epistatic interactions being built from SNPs that are individually significant, it is important to note that in the example of the nicotinic receptor cluster, two non-significant SNPs (rs880395 and rs1948) modify the effect of a

GWAS-based risk variant (rs16969968). This suggests yet another class of interaction effects in which one variant is significant individually, while the other ‘modifier variant’ is not. This is not unlike findings in cystic fibrosis where regulatory variants have been

120 shown to moderate effects of the Mendelian SNPs responsible for disease (Cutting,

2010).

Linkage disequilibrium reflects imprint of evolution

The nicotinic receptor locus highlights the utility of linkage disequilibrium in resolving interactions. Linkage disequilibrium contains essential information about the potential for interactions between genetic variants. In the nicotinic locus, 3 SNPs of interest define a total of 27 different diplotypes, but only 13 of these account for 99% of the population. Thus linkage disequilibrium can be used to direct the search for disease- relevant non-additive effects in a biologically meaningful way. Due to the curse of dimensionality, such effects could be extremely difficult to capture by analyzing the interaction effects in generalized linear models, the standard approach, indicating a need for new computational and theoretical approaches. Regardless, a similar exploration of co-occurrence among GWAS variants may generate better models than linear ones currently being used to estimate heritability. In a way, linkage disequilibrium is the imprint left by evolution and strong patterns like that observed in both the nicotinic receptor locus and in the CETP locus, suggest the presence of evolutionary pressure.

It is in light of this evolutionary pressure that genetic variants, however they may arise, become fixed and common within the population. For this reason I maintain, as put forth in Sadee et.al, that epistatic interactions are most likely to occur in context of health rather than disease. Deleterious variants are selected against so they may not be frequent enough to develop interactions and due to their low frequency maybe difficult to detect by the current approach in GWAS. Positively selected variants however appear 121 frequently and on different backgrounds where beneficial interactions can occur.

Competing evolutionary pressures may allow for positive selection of variants that are under some circumstances deleterious. Thus direction of effect and frequency of a variant may not always be reliable indicators.

Gene expression, which is not by definition beneficial or deleterious is an ideal phenotype to consider; while large variability, which can signify interactions, is a potential marker of epistasis (Rönnegård et al., 2012). Thus, in my opinion one of the most promising avenues to uncover relevant interactions is to consider SNPs that account for variability in gene expression, explored as a proof of concept in the second chapter.

What might this complex system look like?

Here, by using gene expression to guide a search for functional variants and consider their interactions, I present evidence for multiple functional haplotypes. These at times exhibit opposing direction of effect as with dual enhancer variants (rs247617 and rs6499863) associated with expression of CETP or risk (rs16969968, rs1948) and protective (rs880395) variants associated with nicotine dependence. Multiple haplotypes contributing to regulation of individual transcripts, genes, and clinical traits suggests multiple components are involved at all stages of gene regulation in order to titrate gene expression or clinical features to optimal levels.

Consideration of patterns of gene expression reveals that disease information is not encoded in a single gene, but rather dispersed across network of interactions.

Operating under the hypothesis that disease results not from dyregulation of a single gene, but from the gain or loss of synergy between genes, I consider pairwise interaction 122 terms between expression patterns of candidate genes. By comparing additive versus interactive models, the third chapter perhaps provides the greatest insight into the structure of interactions. What we find is not a handful of pairs made from just a few genes, but virtually all candidate genes we considered were represented among only a few hundred pairs. Compared to additive models, interaction models reveal fewer pairs of

RNAs that are comprised of a more diverse set of genes, suggesting the information content is not within a few select genes but rather imbedded in the relationships defined using almost all of these genes. Just as the ratio between HDL and LDL is a better predictor of cardiovascular disease than either alone, I am more convinced than ever that the information about health and disease is not encoded in individual values, but dispersed across a network of interactions.

123

References

Agerholm-Larsen, B., Nordestgaard, B. G., Steffensen, R., Jensen, G., & Tybjaerg-

Hansen, A. (2000). Elevated HDL cholesterol is a risk factor for ischemic heart

disease in white women when caused by a common mutation in the cholesteryl ester

transfer protein gene. Circulation, 101(16), 1907–12. Retrieved from

http://www.ncbi.nlm.nih.gov/pubmed/10779455

Antolin-Fontes, B., Ables, J. L., Görlich, A., & Ibañez-Tallon, I. (2015). The habenulo-

interpeduncular pathway in nicotine aversion and withdrawal. Neuropharmacology,

96(Pt B), 213–22. doi:10.1016/j.neuropharm.2014.11.019

Arca, M., & Gaspardone, A. (2007). Atorvastatin efficacy in the primary and secondary

prevention of cardiovascular events. Drugs, 67 Suppl 1, 29–42. Retrieved from

http://www.ncbi.nlm.nih.gov/pubmed/17910519

Auton, A., Abecasis, G. R., Altshuler, D. M., Durbin, R. M., Bentley, D. R., Chakravarti,

A., … Schloss, J. A. (2015). A global reference for human genetic variation. Nature,

526(7571), 68–74. doi:10.1038/nature15393

Aziz, H., Zaas, A., & Ginsburg, G. S. (2007). Peripheral blood gene expression profiling

for cardiovascular disease assessment. Genomic Medicine, 1(3-4), 105–12.

doi:10.1007/s11568-008-9017-x

Azzam, K. M., & Fessler, M. B. (2012). Crosstalk between reverse cholesterol transport

and innate immunity. Trends in Endocrinology and Metabolism: TEM, 23(4), 169– 124 78. doi:10.1016/j.tem.2012.02.001

Baran, Y., Subramaniam, M., Biton, A., Tukiainen, T., Tsang, E. K., Rivas, M. A., …

Lappalainen, T. (2015). The landscape of genomic imprinting across diverse adult

human tissues. Genome Research, 25(7), 927–36. doi:10.1101/gr.192278.115

Barter, P. J., Caulfield, M., Eriksson, M., Grundy, S. M., Kastelein, J. J. P., Komajda, M.,

… Brewer, B. (2007). Effects of torcetrapib in patients at high risk for coronary

events. The New England Journal of Medicine, 357(21), 2109–22.

doi:10.1056/NEJMoa0706628

Bierut, L. J., Stitzel, J. A., Wang, J. C., Hinrichs, A. L., Grucza, R. A., Xuei, X., …

Goate, A. M. (2008). Variants in nicotinic receptors and risk for nicotine

dependence. The American Journal of Psychiatry, 165(9), 1163–71.

doi:10.1176/appi.ajp.2008.07111711

Bingle, C. D., LeClair, E. E., Havard, S., Bingle, L., Gillingham, P., & Craven, C. J.

(2004). Phylogenetic and evolutionary analysis of the PLUNC gene family. Protein

Science : A Publication of the Protein Society, 13(2), 422–30.

doi:10.1110/ps.03332704

Boekholdt, S. M., Sacks, F. M., Jukema, J. W., Shepherd, J., Freeman, D. J., McMahon,

A. D., … Kastelein, J. J. P. (2005). Cholesteryl ester transfer protein TaqIB variant,

high-density lipoprotein cholesterol levels, cardiovascular risk, and efficacy of

pravastatin treatment: individual patient meta-analysis of 13,677 subjects.

Circulation, 111(3), 278–87. doi:10.1161/01.CIR.0000153341.46271.40

Bolotin, E., Armendariz, A., Kim, K., Heo, S.-J., Boffelli, D., Tantisira, K., … Medina,

M. W. (2014). Statin-induced changes in gene expression in EBV-transformed and 125 native B-cells. Human Molecular Genetics, 23(5), 1202–10.

doi:10.1093/hmg/ddt512

Brænne, I., Civelek, M., Vilne, B., Di Narzo, A., Johnson, A. D., Zhao, Y., … Leducq

Consortium CAD Genomics‡. (2015). Prediction of Causal Candidate Genes in

Coronary Artery Disease Loci. Arteriosclerosis, Thrombosis, and Vascular Biology,

35(10), 2207–17. doi:10.1161/ATVBAHA.115.306108

Braun, T. R., Been, L. F., Singhal, A., Worsham, J., Ralhan, S., Wander, G. S., … Sham,

P. (2012). A Replication Study of GWAS-Derived Lipid Genes in Asian Indians:

The Chromosomal Region 11q23.3 Harbors Loci Contributing to Triglycerides.

PLoS ONE, 7(5), e37056. doi:10.1371/journal.pone.0037056

Burdett T (EBI), Hall PN (NHGRI), Hastings E (EBI), Hindorff LA (NHGRI), Junkins

HA (NHGRI), Klemm AK (NHGRI), MacArthur J (EBI), Manolio TA (NHGRI),

Morales J (EBI), P. H. (EBI) and W. D. (EBI). (n.d.). The NHGRI-EBI Catalog of

published genome-wide association studies. Retrieved from www.ebi.ac.uk/gwas

Bush, W. S., Dudek, S. M., & Ritchie, M. D. (2009). BIOFILTER: A KNOWLEDGE-

INTEGRATION SYSTEM FOR THE MULTI-LOCUS ANALYSIS OF GENOME-

WIDE ASSOCIATION STUDIES.

Cippitelli, A., Brunori, G., Gaiolini, K. A., Zaveri, N. T., & Toll, L. (2015).

Pharmacological stress is required for the anti-alcohol effect of the α3β4* nAChR

partial agonist AT-1001. Neuropharmacology, 93, 229–36.

doi:10.1016/j.neuropharm.2015.02.005

Consortium, the Cardi. (2015). A comprehensive 1000 Genomes-based genome-wide

association meta-analysis of coronary artery disease. Nature Genetics. 126 doi:10.1038/ng.3396

Consortium, R. E., Kundaje, A., Meuleman, W., Ernst, J., Bilenky, M., Yen, A., …

Ziegler, S. (2015). Integrative analysis of 111 reference human epigenomes. Nature,

518(7539), 317–330. doi:10.1038/nature14248

Cutting, G. R. (2010). Modifier genes in Mendelian disorders: the example of cystic

fibrosis. Annals of the New York Academy of Sciences, 1214, 57–69.

doi:10.1111/j.1749-6632.2010.05879.x

Deloukas, P., Kanoni, S., Willenborg, C., Farrall, M., Assimes, T. L., Thompson, J. R.,

… Samani, N. J. (2013). Large-scale association analysis identifies new risk loci for

coronary artery disease. Nature Genetics, 45(1), 25–33. doi:10.1038/ng.2480

Do, C. B., Hinds, D. A., Francke, U., & Eriksson, N. (2012). Comparison of family

history and SNPs for predicting risk of complex disease. PLoS Genetics, 8(10),

e1002973. doi:10.1371/journal.pgen.1002973

Douglas, R., Fienberg, S. E., Lee, M. L. T., Sampson, A. R., & Whitaker, L. R. (1990).

Positive dependence concepts for ordinal contingency tables, 189–202. Retrieved

from http://projecteuclid.org/euclid.lnms/1215457559

Doyle, G. A., Wang, M.-J., Chou, A. D., Oleynick, J. U., Arnold, S. E., Buono, R. J., …

Berrettini, W. H. (2011). In vitro and ex vivo analysis of CHRNA3 and CHRNA5

haplotype expression. PloS One, 6(8), e23373. doi:10.1371/journal.pone.0023373

Elguindy, A., & Yacoub, M. H. (2013). The discovery of PCSK9 inhibitors: A tale of

creativity and multifaceted translational research. Global Cardiology Science &

Practice, 2013(4), 343–7. doi:10.5339/gcsp.2013.39

Flora, A. V, Zambrano, C. A., Gallego, X., Miyamoto, J. H., Johnson, K. A., Cowan, K. 127 A., … Ehringer, M. A. (2013). Functional characterization of SNPs in CHRNA3/B4

intergenic region associated with drug behaviors. Brain Research, 1529, 1–15.

doi:10.1016/j.brainres.2013.07.017

Flora, A., Schulz, R., Benfante, R., Battaglioli, E., Terzano, S., Clementi, F., & Fornasari,

D. (2000). Transcriptional regulation of the human alpha5 nicotinic receptor subunit

gene in neuronal and non-neuronal tissues. European Journal of Pharmacology,

393(1-3), 85–95. Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/10771001

Gabriel, S. B. (2002). The Structure of Haplotype Blocks in the Human Genome.

Science, 296(5576), 2225–2229. doi:10.1126/science.1069424

Gaiteri, C., Ding, Y., French, B., Tseng, G. C., & Sibille, E. (2014). Beyond modules and

hubs: the potential of gene coexpression networks for investigating molecular

mechanisms of complex brain disorders. Genes, Brain, and Behavior, 13(1), 13–24.

doi:10.1111/gbb.12106

Gamazon, E. R., Wheeler, H. E., Shah, K. P., Mozaffari, S. V, Aquino-Michaels, K.,

Carroll, R. J., … Im, H. K. (2015). A gene-based association method for mapping

traits using reference transcriptome data. Nature Genetics, 47(9), 1091–1098.

doi:10.1038/ng.3367

Grimaldi, M., Visintainer, R., & Jurman, G. (2011). RegnANN: Reverse Engineering

Gene Networks using Artificial Neural Networks. PloS One, 6(12), e28646.

doi:10.1371/journal.pone.0028646

Gusev, A., Ko, A., Shi, H., Bhatia, G., Chung, W., Penninx, B. W. J. H., … Pasaniuc, B.

(2016). Integrative approaches for large-scale transcriptome-wide association

studies. Nature Genetics, 48(3), 245–52. doi:10.1038/ng.3506 128 Halley, P., Kadakkuzha, B. M., Faghihi, M. A., Magistri, M., Zeier, Z., Khorkova, O., …

Wahlestedt, C. (2014). Regulation of the apolipoprotein gene cluster by a long

noncoding RNA. Cell Reports, 6(1), 222–30. doi:10.1016/j.celrep.2013.12.015

Hao, Y., Wu, W., Shi, F., Dalmolin, R. J. S., Yan, M., Tian, F., … Cao, W. (2015).

Prediction of long noncoding RNA functions with co-expression network in

esophageal squamous cell carcinoma. BMC Cancer, 15, 168. doi:10.1186/s12885-

015-1179-z

Heidari, N., Phanstiel, D. H., He, C., Grubert, F., Jahanbani, F., Kasowski, M., …

Snyder, M. P. (2014). Genome-wide map of regulatory interactions in the human

genome. Genome Research, 24(12), 1905–17. doi:10.1101/gr.176586.114

Hesler, C. B., Swenson, T. L., & Tall, A. R. (1987). Purification and characterization of a

human plasma cholesteryl ester transfer protein. The Journal of Biological

Chemistry, 262(5), 2275–82. Retrieved from

http://www.ncbi.nlm.nih.gov/pubmed/3818596

Hiltunen, M. O., Tuomisto, T. T., Niemi, M., Bräsen, J. H., Rissanen, T. T., Törönen, P.,

… Ylä-Herttuala, S. (2002). Changes in gene expression in atherosclerotic plaques

analyzed using DNA array. Atherosclerosis, 165(1), 23–32. doi:10.1016/S0021-

9150(02)00187-9

Hindorff, L. A., Sethupathy, P., Junkins, H. A., Ramos, E. M., Mehta, J. P., Collins, F. S.,

& Manolio, T. A. (2009). Potential etiologic and functional implications of genome-

wide association loci for human diseases and traits. Proceedings of the National

Academy of Sciences of the United States of America, 106, 9362–9367.

doi:10.1073/pnas.0903103106 129 Inazu, A., Brown, M. L., Hesler, C. B., Agellon, L. B., Koizumi, J., Takata, K., … Tall,

A. R. (1990). Increased high-density lipoprotein levels caused by a common

cholesteryl-ester transfer protein gene mutation. The New England Journal of

Medicine, 323(18), 1234–8. doi:10.1056/NEJM199011013231803

Ito, K., Ishikawa, F., Kanno, T., Ishikawa, Y., Akasaka, Y., Akishima, Y., … Mori, S.

(2004). Expression of cholesteryl ester transfer protein (CETP) in germinal centre B

cells and their neoplastic counterparts. Histopathology, 45(1), 73–81.

doi:10.1111/j.1365-2559.2004.01905.x

Joehanes, R., Ying, S., Huan, T., Johnson, A. D., Raghavachari, N., Wang, R., …

Munson, P. J. (2013). Gene expression signatures of coronary heart disease.

Arteriosclerosis, Thrombosis, and Vascular Biology, 33(6), 1418–26.

doi:10.1161/ATVBAHA.112.301169

Johnson, A. D., O’Donnell, C. J., Chanock, S., Manolio, T., Boehnke, M., Boerwinkle,

E., … Bakker, P. de. (2009). An Open Access Database of Genome-wide

Association Results. BMC Medical Genetics, 10(1), 6. doi:10.1186/1471-2350-10-6

Kaakinen, M., Ducci, F., Sillanpää, M. J., Läärä, E., & Järvelin, M.-R. (2012).

Associations between variation in CHRNA5-CHRNA3-CHRNB4, body mass index

and blood pressure in the Northern Finland Birth Cohort 1966. PloS One, 7(9),

e46557. doi:10.1371/journal.pone.0046557

Kaminsky, Z. A., Tang, T., Wang, S.-C., Ptak, C., Oh, G. H. T., Wong, A. H. C., …

Petronis, A. (2009). DNA methylation profiles in monozygotic and dizygotic twins.

Nature Genetics, 41(2), 240–5. doi:10.1038/ng.286

Kapiotis, S., Hermann, M., Exner, M., Laggner, H., & Gmeiner, B. M. K. (2005). 130 Copper- and magnesium protoporphyrin complexes inhibit oxidative modification of

LDL induced by hemin, transition metal ions and tyrosyl radicals. Free Radical

Research, 39(11), 1193–202. doi:10.1080/10715760500138981

Kindermann, R., & Snell, J. L. (1980). Markov Random Fields and Their Applications

(Vol. 1). Providence, Rhode Island: American Mathematical Society.

doi:10.1090/conm/001

Kuivenhoven, J. A., Jukema, J. W., Zwinderman, A. H., de Knijff, P., McPherson, R.,

Bruschke, A. V, … Kastelein, J. J. (1998). The role of a common variant of the

cholesteryl ester transfer protein gene in the progression of coronary atherosclerosis.

The Regression Growth Evaluation Statin Study Group. The New England Journal

of Medicine, 338(2), 86–93. doi:10.1056/NEJM199801083380203

Kuryatov, A., Berrettini, W., & Lindstrom, J. (2011). Acetylcholine receptor (AChR) α5

subunit variant associated with risk for nicotine dependence and lung cancer reduces

(α4β2)ⁿα5 AChR function. Molecular Pharmacology, 79(1), 119–25.

doi:10.1124/mol.110.066357

Lee, J. Y., Yoo, S. S., Kang, H.-G., Jin, G., Bae, E. Y., Choi, Y. Y., … Park, J. Y. (2012).

A functional polymorphism in the CHRNA3 gene and risk of chronic obstructive

pulmonary disease in a Korean population. Journal of Korean Medical Science,

27(12), 1536–40. doi:10.3346/jkms.2012.27.12.1536

Leusink, M., Onland-Moret, N. C., Asselbergs, F. W., Ding, B., Kotti, S., van Zuydam,

N. R., … Maitland-van der Zee, A. H. (2014). Cholesteryl ester transfer protein

polymorphisms, statin use, and their impact on cholesterol levels and cardiovascular

events. Clinical Pharmacology and Therapeutics, 95(3), 314–20. 131 doi:10.1038/clpt.2013.194

Li, C., Qian, W., Maclean, C. J., Zhang, J., Visser, J. A. de, Krug, J., … Hofacker, I. L.

(2016). The fitness landscape of a tRNA gene. Science (New York, N.Y.), 352(6287),

837–40. doi:10.1126/science.aae0568

Li, G., Ruan, X., Auerbach, R. K., Sandhu, K. S., Zheng, M., Wang, P., … Ruan, Y.

(2012). Extensive promoter-centered chromatin interactions provide a topological

basis for transcription regulation. Cell, 148(1-2), 84–98.

doi:10.1016/j.cell.2011.12.014

Lippert, C., Listgarten, J., Davidson, R. I., Baxter, S., Poon, H., Poong, H., …

Heckerman, D. (2013). An exhaustive epistatic SNP association analysis on

expanded Wellcome Trust data. Scientific Reports, 3, 1099. doi:10.1038/srep01099

Liu, J. Z., Tozzi, F., Waterworth, D. M., Pillai, S. G., Muglia, P., Middleton, L., …

Marchini, J. (2010). Meta-analysis and imputation refines the association of 15q25

with smoking quantity. Nature Genetics, 42(5), 436–40. doi:10.1038/ng.572

Lonsdale, J., Thomas, J., Salvatore, M., Phillips, R., Lo, E., Shad, S., … Moore, H. F.

(2013). The Genotype-Tissue Expression (GTEx) project. Nature Genetics, 45(6),

580–5. doi:10.1038/ng.2653

Maher, B. (2008). Personal genomes: The case of the missing heritability. Nature,

456(7218), 18–21. doi:10.1038/456018a

Mann, D., Reynolds, K., Smith, D., & Muntner, P. (2008). Trends in statin use and low-

density lipoprotein cholesterol levels among US adults: impact of the 2001 National

Cholesterol Education Program guidelines. The Annals of Pharmacotherapy, 42(9),

1208–15. doi:10.1345/aph.1L181 132 Margolin, A. A., Nemenman, I., Basso, K., Wiggins, C., Stolovitzky, G., Dalla Favera,

R., & Califano, A. (2006). ARACNE: an algorithm for the reconstruction of gene

regulatory networks in a mammalian cellular context. BMC Bioinformatics, 7 Suppl

1(Suppl 1), S7. doi:10.1186/1471-2105-7-S1-S7

Maurano, M. T., Haugen, E., Sandstrom, R., Vierstra, J., Shafer, A., Kaul, R., &

Stamatoyannopoulos, J. A. (2015). Large-scale identification of sequence variants

influencing human transcription factor occupancy in vivo. Nature Genetics, 47(12),

1393–401. doi:10.1038/ng.3432

Minicã, C. C., Mbarek, H., Pool, R., Dolan, C. V, Boomsma, D. I., & Vink, J. M. (2016).

Pathways to smoking behaviours: biological insights from the Tobacco and Genetics

Consortium meta-analysis. Molecular Psychiatry. doi:10.1038/mp.2016.20

Moore, J. H., & Williams, S. M. (2005). Traversing the conceptual divide between

biological and statistical epistasis: systems biology and a more modern synthesis.

BioEssays : News and Reviews in Molecular, Cellular and Developmental Biology,

27(6), 637–46. doi:10.1002/bies.20236

Nagano, M., Yamashita, S., Hirano, K.-I., Takano, M., Maruyama, T., Ishihara, M., …

Matsuzawa, Y. (2004). Molecular mechanisms of cholesteryl ester transfer protein

deficiency in Japanese. Journal of Atherosclerosis and Thrombosis, 11(3), 110–21.

Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/15256762

Nanni, L., Romualdi, C., Maseri, A., & Lanfranchi, G. (2006). Differential gene

expression profiling in genetic and multifactorial cardiovascular diseases. Journal of

Molecular and Cellular Cardiology, 41(6), 934–48.

doi:10.1016/j.yjmcc.2006.08.009 133 Nikpay, M., Goel, A., Won, H.-H., Hall, L. M., Willenborg, C., Kanoni, S., … Farrall, M.

(2015). A comprehensive 1000 Genomes–based genome-wide association meta-

analysis of coronary artery disease. Nature Genetics, 47(10), 1121–1130.

doi:10.1038/ng.3396

Papp, A. C., Pinsonneault, J. K., Wang, D., Newman, L. C., Gong, Y., Johnson, J. A., …

Sadee, W. (2012). Cholesteryl Ester Transfer Protein (CETP) polymorphisms affect

mRNA splicing, HDL levels, and sex-dependent cardiovascular risk. PloS One, 7(3),

e31930. doi:10.1371/journal.pone.0031930

Perry, D. C., Xiao, Y., Nguyen, H. N., Musachio, J. L., Dávila-García, M. I., & Kellar, K.

J. (2002). Measuring nicotinic receptors with characteristics of alpha4beta2,

alpha3beta2 and alpha3beta4 subtypes in rat tissues by autoradiography. Journal of

Neurochemistry, 82(3), 468–81. Retrieved from

http://www.ncbi.nlm.nih.gov/pubmed/12153472

Radovica, I., Fridmanis, D., Vaivade, I., Nikitina-Zake, L., Klovins, J., Gordon, D., …

Cohen, J. (2013). The Association of Common SNPs and Haplotypes in CETP Gene

with HDL Cholesterol Levels in Latvian Population. PLoS ONE, 8(5), e64191.

doi:10.1371/journal.pone.0064191

Ramsay, J. E., Rhodes, C. H., Thirtamara-Rajamani, K., & Smith, R. M. (2015). Genetic

influences on nicotinic α5 receptor (CHRNA5) CpG methylation and mRNA

expression in brain and adipose tissue. Genes and Environment, 37(1), 14.

doi:10.1186/s41021-015-0020-x

Randi, A. M., Biguzzi, E., Falciani, F., Merlini, P., Blakemore, S., Bramucci, E., …

Mannucci, P. M. (2003). Identification of differentially expressed genes in coronary 134 atherosclerotic plaques from patients with stable or unstable angina by cDNA array

analysis. Journal of Thrombosis and Haemostasis : JTH, 1(4), 829–35. Retrieved

from http://www.ncbi.nlm.nih.gov/pubmed/12871422

Regieli, J. J., Jukema, J. W., Grobbee, D. E., Kastelein, J. J. P., Kuivenhoven, J. A.,

Zwinderman, A. H., … Doevendans, P. A. (2008). CETP genotype predicts

increased mortality in statin-treated men with proven cardiovascular disease: an

adverse pharmacogenetic interaction. European Heart Journal, 29(22), 2792–9.

doi:10.1093/eurheartj/ehn465

Rempala, G. A., & Seweryn, M. (2013). Methods for diversity and overlap analysis in T-

cell receptor populations. Journal of Mathematical Biology, 67(6-7), 1339–68.

doi:10.1007/s00285-012-0589-7

Ritchie, M. D. (2011). Using biological knowledge to uncover the mystery in the search

for epistasis in genome-wide association studies. Annals of Human Genetics, 75(1),

172–82. doi:10.1111/j.1469-1809.2010.00630.x

Roberts, R. (2014). Genetics of coronary artery disease: an update. Methodist DeBakey

Cardiovascular Journal, 10(1), 7–12. Retrieved from

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=4051327&tool=pmcentr

ez&rendertype=abstract

Rönnegård, L., Valdar, W., Rönnegård, L., Valdar, W., Visscher, P., Posthuma, D., …

Wagner, G. (2012). Recent developments in statistical methods for detecting genetic

loci affecting phenotypic variability. BMC Genetics, 13(1), 63. doi:10.1186/1471-

2156-13-63

Sadee, W., Hartmann, K., Seweryn, M., Pietrzak, M., Handelman, S. K., & Rempala, G. 135 A. (2014). Missing heritability of common diseases and treatments outside the

protein-coding exome. Human Genetics, 133(10), 1199–215. doi:10.1007/s00439-

014-1476-7

Sanyal, A., Lajoie, B. R., Jain, G., & Dekker, J. (2012). The long-range interaction

landscape of gene promoters. Nature, 489(7414), 109–13. doi:10.1038/nature11279

Schadt, E. E. (2009). Molecular networks as sensors and drivers of common human

diseases. Nature, 461(7261), 218–23. doi:10.1038/nature08454

Schierer, A., Been, L. F., Ralhan, S., Wander, G. S., Aston, C. E., & Sanghera, D. K.

(2012). Genetic variation in cholesterol ester transfer protein, serum CETP activity,

and coronary artery disease risk in Asian Indian diabetic cohort. Pharmacogenetics

and Genomics, 22(2), 95–104. doi:10.1097/FPC.0b013e32834dc9ef

Schlaepfer, I. R., Hoft, N. R., & Ehringer, M. A. (2008). The genetic components of

alcohol and nicotine co-addiction: from genes to behavior. Current Drug Abuse

Reviews, 1(2), 124–34. Retrieved from

http://www.pubmedcentral.nih.gov/articlerender.fcgi?artid=2600802&tool=pmcentr

ez&rendertype=abstract

Schwartz, G. G., Olsson, A. G., Abt, M., Ballantyne, C. M., Barter, P. J., Brumm, J., …

Wright, R. S. (2012). Effects of dalcetrapib in patients with a recent acute coronary

syndrome. The New England Journal of Medicine, 367(22), 2089–99.

doi:10.1056/NEJMoa1206797

Seidah, N. G., Awan, Z., Chrétien, M., & Mbikay, M. (2014). PCSK9: a key modulator

of cardiovascular health. Circulation Research, 114(6), 1022–36.

doi:10.1161/CIRCRESAHA.114.301621 136 Senner, N. R., Conklin, J. R., & Piersma, T. (2015). An ontogenetic perspective on

individual differences. Proceedings. Biological Sciences / The Royal Society,

282(1814). doi:10.1098/rspb.2015.1050

Seo, D., Wang, T., Dressman, H., Herderick, E. E., Iversen, E. S., Dong, C., …

Goldschmidt-Clermont, P. J. (2004). Gene expression phenotypes of atherosclerosis.

Arteriosclerosis, Thrombosis, and Vascular Biology, 24(10), 1922–7.

doi:10.1161/01.ATV.0000141358.65242.1f

Sinnaeve, P. R., Donahue, M. P., Grass, P., Seo, D., Vonderscher, J., Chibout, S.-D., …

Granger, C. B. (2009). Gene expression patterns in peripheral blood correlate with

the extent of coronary artery disease. PloS One, 4(9), e7037.

doi:10.1371/journal.pone.0007037

Smith, R. M., Alachkar, H., Papp, A. C., Wang, D., Mash, D. C., Wang, J.-C., … Sadee,

W. (2011). Nicotinic α5 receptor subunit mRNA expression is associated with

distant 5’ upstream polymorphisms. European Journal of Human Genetics : EJHG,

19(1), 76–83. doi:10.1038/ejhg.2010.120

Stephens, S. H., Hartz, S. M., Hoft, N. R., Saccone, N. L., Corley, R. C., Hewitt, J. K., …

Ehringer, M. A. (2013). Distinct loci in the CHRNA5/CHRNA3/CHRNB4 gene

cluster are associated with onset of regular smoking. Genetic Epidemiology, 37(8),

846–59. doi:10.1002/gepi.21760

Stergachis, A. B., Neph, S., Sandstrom, R., Haugen, E., Reynolds, A. P., Zhang, M., …

Stamatoyannopoulos, J. A. (2014). Conservation of trans-acting circuitry during

mammalian regulatory evolution. Nature, 515(7527), 365–370.

doi:10.1038/nature13972 137 Stuiver, M., Lainez, S., Will, C., Terryn, S., Günzel, D., Debaix, H., … Müller, D.

(2011). CNNM2, encoding a basolateral protein required for renal Mg2+ handling,

is mutated in dominant hypomagnesemia. American Journal of Human Genetics,

88(3), 333–43. doi:10.1016/j.ajhg.2011.02.005

Sudmant, P. H., Rausch, T., Gardner, E. J., Handsaker, R. E., Abyzov, A., Huddleston, J.,

… Korbel, J. O. (2015). An integrated map of structural variation in 2,504 human

genomes. Nature, 526(7571), 75–81. doi:10.1038/nature15394

Sueda, S., Fukuda, H., Watanabe, K., Suzuki, J., Saeki, H., Ohtani, T., & Uraoka, T.

(2001). Magnesium deficiency in patients with recent myocardial infarction and

provoked coronary artery spasm. Japanese Circulation Journal, 65(7), 643–8.

Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/11446499

Suhy, A., Hartmann, K., Newman, L., Papp, A., Toneff, T., Hook, V., & Sadee, W.

(2014). Genetic variants affecting alternative splicing of human cholesteryl ester

transfer protein. Biochemical and Biophysical Research Communications, 443(4),

1270–4. doi:10.1016/j.bbrc.2013.12.127

Suhy, A., Hartmann, K., Papp, A. C., Wang, D., & Sadee, W. (2015). Regulation of

cholesteryl ester transfer protein expression by upstream polymorphisms: reduced

expression associated with rs247616. Pharmacogenetics and Genomics.

doi:10.1097/FPC.0000000000000151

Sun, X., Lu, Q., Mukherjee, S., Mukheerjee, S., Crane, P. K., Elston, R., & Ritchie, M. D.

(2014). Analysis pipeline for the epistasis search - statistical versus biological

filtering. Frontiers in Genetics, 5, 106. doi:10.3389/fgene.2014.00106

Tada, H., Won, H.-H., Melander, O., Yang, J., Peloso, G. M., & Kathiresan, S. (2014). 138 Multiple Associated Variants Increase the Heritability Explained for Plasma Lipids

and Coronary Artery Disease. Circulation: Cardiovascular Genetics, 7(5), 583–587.

doi:10.1161/CIRCGENETICS.113.000420

Tada, H., Won, H.-H., Melander, O., Yang, J., Peloso, G. M., & Kathiresan, S. (2014).

Multiple associated variants increase the heritability explained for plasma lipids and

coronary artery disease. Circulation. Cardiovascular Genetics, 7(5), 583–7.

doi:10.1161/CIRCGENETICS.113.000420

Toll, L., Zaveri, N. T., Polgar, W. E., Jiang, F., Khroyan, T. V, Zhou, W., … Leslie, F.

M. (2012). AT-1001: a high affinity and selective α3β4 nicotinic acetylcholine

receptor antagonist blocks nicotine self-administration in rats.

Neuropsychopharmacology : Official Publication of the American College of

Neuropsychopharmacology, 37(6), 1367–76. doi:10.1038/npp.2011.322

Torkamani, A., Topol, E. J., & Schork, N. J. (2008). Pathway analysis of seven common

diseases assessed by genome-wide association. Genomics, 92(5), 265–72.

doi:10.1016/j.ygeno.2008.07.011

Tyndale, R. F., Zhu, A. Z. X., George, T. P., Cinciripini, P., Hawk, L. W., Schnoll, R. A.,

… Lerman, C. (2015). Lack of Associations of CHRNA5-A3-B4 Genetic Variants

with Smoking Cessation Treatment Outcomes in Caucasian Smokers despite

Associations with Baseline Smoking. PloS One, 10(5), e0128109.

doi:10.1371/journal.pone.0128109

Van Eck, M., Ye, D., Hildebrand, R. B., Kar Kruijt, J., de Haan, W., Hoekstra, M., …

Van Berkel, T. J. C. (2007). Important Role for Bone Marrow-Derived Cholesteryl

Ester Transfer Protein in Lipoprotein Cholesterol Redistribution and Atherosclerotic 139 Lesion Development in LDL Receptor Knockout Mice. Circulation Research,

100(5), 678–685. doi:10.1161/01.RES.0000260202.79927.4f van Erven, T., & Harremoes, P. (2014). Rényi Divergence and Kullback-Leibler

Divergence. IEEE Transactions on Information Theory, 60(7), 3797–3820.

Information Theory; Information Theory; Statistics; Machine Learning; Theory.

doi:10.1109/TIT.2014.2320500

Wang, D., Poi, M. J., Sun, X., Gaedigk, A., Leeder, J. S., & Sadee, W. (2014). Common

CYP2D6 polymorphisms affecting alternative splicing and transcription: long-range

haplotypes with two regulatory variants modulate CYP2D6 activity. Human

Molecular Genetics, 23(1), 268–78. doi:10.1093/hmg/ddt417

Weiss, R. B., Baker, T. B., Cannon, D. S., von Niederhausern, A., Dunn, D. M.,

Matsunami, N., … Leppert, M. F. (2008). A candidate gene approach identifies the

CHRNA5-A3-B4 region as a risk factor for age-dependent nicotine addiction. PLoS

Genetics, 4(7), e1000125. doi:10.1371/journal.pgen.1000125

Welter, D., MacArthur, J., Morales, J., Burdett, T., Hall, P., Junkins, H., … Parkinson, H.

(2014). The NHGRI GWAS Catalog, a curated resource of SNP-trait associations.

Nucleic Acids Research, 42(Database issue), D1001–6. doi:10.1093/nar/gkt1229

Whalen, S., Truty, R. M., & Pollard, K. S. (2016). Enhancer-promoter interactions are

encoded by complex genomic signatures on looping chromatin. Nature Genetics,

48(5), 488–496. doi:10.1038/ng.3539

Willer, C. J., Schmidt, E. M., Sengupta, S., Peloso, G. M., Gustafsson, S., Kanoni, S., …

Abecasis, G. R. (2013). Discovery and refinement of loci associated with lipid

levels. Nature Genetics, 45(11), 1274–1283. doi:10.1038/ng.2797 140 Willyard, C. (2015). Pharmacotherapy: Quest for the quitting pill. Nature, 522(7557),

S53–5. doi:10.1038/522S53a

Wingrove, J. A., Daniels, S. E., Sehnert, A. J., Tingley, W., Elashoff, M. R., Rosenberg,

S., … Kraus, W. E. (2008). Correlation of peripheral-blood gene expression with the

extent of coronary artery stenosis. Circulation. Cardiovascular Genetics, 1(1), 31–8.

doi:10.1161/CIRCGENETICS.108.782730

Won, H.-H., Natarajan, P., Dobbyn, A., Jordan, D. M., Roussos, P., Lage, K., … Tatar,

D. (2015). Disproportionate Contributions of Select Genomic Compartments and

Cell Types to Genetic Risk for Coronary Artery Disease. PLOS Genetics, 11(10),

e1005622. doi:10.1371/journal.pgen.1005622

Wu, C., Hu, Z., Yu, D., Huang, L., Jin, G., Liang, J., … Shen, H. (2009). Genetic variants

on chromosome 15q25 associated with lung cancer risk in Chinese populations.

Cancer Research, 69(12), 5065–72. doi:10.1158/0008-5472.CAN-09-0081

Wu, M.-Y., Zhang, X.-F., Dai, D.-Q., Ou-Yang, L., Zhu, Y., & Yan, H. (2016).

Regularized logistic regression with network-based pairwise interaction for

biomarker identification in breast cancer. BMC Bioinformatics, 17(1), 108.

doi:10.1186/s12859-016-0951-7

Zdravkovic, S., Wienke, A., Pedersen, N. L., Marenberg, M. E., Yashin, A. I., & De

Faire, U. (2002). Heritability of death from coronary heart disease: a 36-year follow-

up of 20 966 Swedish twins. Journal of Internal Medicine, 252(3), 247–54.

Retrieved from http://www.ncbi.nlm.nih.gov/pubmed/12270005

Zhang, Y., Wong, C.-H., Birnbaum, R. Y., Li, G., Favaro, R., Ngan, C. Y., … Wei, C.-L.

(2013). Chromatin connectivity maps reveal dynamic promoter-enhancer long-range 141 associations. Nature, 504(7479), 306–10. doi:10.1038/nature12716

Zhao, Y., Luo, H., Chen, X., Xiao, Y., & Chen, R. (2014). Computational methods to

predict long noncoding RNA functions based on co-expression network. Methods in

Molecular Biology (Clifton, N.J.), 1182, 209–18. doi:10.1007/978-1-4939-1062-

5_19

142

Appendix A: Ethics Approval

Our access of CATHGEN, FHS, and GTEx was approved by the Ohio State

University IRB (Protocol #2013H0096).

143

Appendix B: Funding Sources

This work was supported by the National Institute of General Medical Sciences

(U01GM092655), the National Center for Advancing Translational Sciences (TL1

TR001069), the US National Science Foundation (DMS-1318886), the US National

Cancer Institute (R01-CA152158) and the CCTS (Center for Clinical and Translational

Science) at OSU under Award Number UL1RR025755 from the National Center For

Research Resources.

CATHGEN

For CATHGEN, clinical data originated from the Duke Databank for

Cardiovascular Disease (DDCD) and biological samples originated from the Duke

Cardiac CATHeterization (CATHGEN) study. Funding support for the Genetic

Mediators of Metabolic CVD Risk was provided by NHLBI grant RC2 HL101621

(William E. Kraus). The data used for the analyses described in this manuscript were obtained from the dbGaP accession number phs0000703.v1.p1 on 25 March 2015.

The Framingham Heart Study

The Framingham Heart Study is conducted and supported by the National Heart,

Lung, and Blood Institute (NHLBI) in collaboration with Boston University (Contract 144 No. N01-HC-25195). This manuscript was not prepared in collaboration with investigators of the Framingham Heart Study and does not necessarily reflect the opinions or views of the Framingham Heart Study, Boston University, or NHLBI.

Additional funding for SABRe was provided by Division of Intramural Research,

NHLBI, and Center for Population Studies, NHLBI. Funding for SHARe Affymetrix genotyping was provided by NHLBI Contract N02-HL- 64278. SHARe Illumina genotyping was provided under an agreement between Illumina and Boston University.

The data used for the analyses described in this manuscript were obtained from the dbGaP accession number phs000007.v23.p1 on 21 July 2014.

The Genetic Architecture of Smoking and Smoking Cessation

The Genetic Architecture of Smoking and Smoking Cessation data was accessed through dbGAP. Genotyping, performed at CIDR, was funded by 1 X01HG005274-01.

CIDR is fully funded through a federal contract from the National Institutes of Health to

The Johns Hopkins University, contract number HHSN268200782096C. Assistance with genotype cleaning, as well as with general study coordination, was provided by the

GENEVA Coordinating Center (U01HG004446). Funding support for collection of datasets and samples was provided by COGEND (P01 CA089392) and UW-TTURC

(P50 DA019706, P50 CA084724).

The Genotype-Tissue Expression Project

The Genotype-Tissue Expression (GTEx) Project was supported by the Common

Fund of the Office of the Director of the National Institutes of Health. Additional funds 145 were provided by the NCI, NHGRI, NHLBI, NIDA, NIMH, and NINDS. Donors were enrolled at Biospecimen Source Sites funded by NCI\SAIC-Frederick, Inc. (SAIC-F) subcontracts to the National Disease Research Interchange (10XS170), Roswell Park

Cancer Institute (10XS171), and Science Care, Inc. (X10S172). The Laboratory, Data

Analysis, and Coordinating Center (LDACC) was funded through a contract

(HHSN268201000029C) to The Broad Institute, Inc. Biorepository operations were funded through an SAIC-F subcontract to Van Andel Institute (10ST1035). Additional data repository and project management were provided by SAIC-F

(HHSN261200800001E). The Brain Bank was supported by a supplements to University of Miami grants DA006227 & DA033684 and to contract N01MH000028. Statistical

Methods development grants were made to the University of Geneva (MH090941 &

MH101814), the University of Chicago (MH090951, MH090937, MH101820,

MH101825), the University of North Carolina - Chapel Hill (MH090936 & MH101819),

Harvard University (MH090948), Stanford University (MH101782), Washington

University St Louis (MH101810), and the University of Pennsylvania (MH101822). The data used for the analyses described in this manuscript were obtained from: the GTEx

Portal in May 2015 and dbGaP accession number phs000424.v5.p1 in April 2014.

Ohio Supercomputer Center

These analyses were made possible by computing time provided by the Ohio

Supercomputer Center, GRANT #: PAS0885-2 PROJECT: COLLABORATION

ENVIRONMENT FOR PHARMACOGENOMICS.

146 SAGE

Funding for SAGE was provided through the NIH Genes, Environment and

Health Initiative [GEI] (U01 HG004422). SAGE is a GWAS funded as part of the Gene

Environment Association Studies (GENEVA) under GEI. Assistance with phenotype harmonization and genotype cleaning, as well as with general study coordination, was provided by the GENEVA Coordinating Center (U01 HG004446). Assistance with data cleaning was provided by the National Center for Biotechnology Information. Support for collection of datasets and samples was provided by COGA; U10 AA008401, COGEND;

P01 CA089392, and FSCD; R01 DA013423. Genotyping at the Johns Hopkins

University Center for Inherited Disease Research (CIDR) was funded by NIH GEI

(U01HG004438), the National Institute on Alcohol Abuse and Alcoholism, NIDA, and the NIH contract High throughput genotyping for studying the genetic contributions to human disease (HHSN268200782096C).

147

Appendix C: Availability of data and materials

The datasets supporting the conclusions of this article are available in the dbGaP repository: phs0000703 ftp://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs000703 (CATHGEN) phs000007 ftp://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs000007 (FHS) phs000424 ftp://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs000424 (GTEX) phs0000092 ftp://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs000092 (SAGE) phs000404 ftp://ftp.ncbi.nlm.nih.gov/dbgap/studies/phs000404 (Genetic Architecture)

148

Appendix D: Abbreviations

AEI allelic expression imbalance

CAD coronary artery disease

CATHGEN CATHeterization in GENetics

FHS Framingham Heart Study

GWAS genome wide association study

LD linkage disequilibrium

MI myocardial infarction

ND nicotine dependence

SAGE Study of Addiction: Genetics and Environment

SNP single nucleotide polymorphism

149

Appendix E: Supplemental Figures

Figure 31. Tissue specific expression of APOA1, APOC1, ALDH2, CYP17A1, and MFGE8. Average RPKM (y-axis) for each tissue type as reported by GTEx. Figure generated using the GTEx Browser.

150 Figure 32. Number of distinct signals associated with eQTLs. Correlation between beta and LD (R2) with the top eQTL for all significant eQTLs in the given gene x tissue combination. GWAS variant is noted in red. Gene x tissue combination is noted above each plot.

151 Figure 32

Continued

152 Figure 32 continued

153

Figure 33. Linkage disequilibrium between LIPA eQTLs in whole blood reveals multiple distinct haplotypes. eQTLs are clustered by their linkage disequilibrium (R2). Colored bar at the top of the plot reflects the strength of the eQTL with more significant p-values appearing as a darker shade of red.

154

Figure 34. Distribution of p-values. Histograms of p-values for 5000 bootstrap replicates of conditional logistic regression (MI ~ gene expression) in the Framingham Heart Study for those genes determined as significant in CATHGEN that did not meet the criteria for being associated with MI in FHS. FURIN, IL6R, RAI1, and UBE2Z were classified as informative of MI based on right-skewed distribution and are shown in Figure 1. CNNM2, GUCY1A3, MRAS, MIA3, PDGFD, PEMT, and LPAL2 did not show evidence of differential expression given almost uniformly distributed p-values and are shown here.

155

Figure 35. LD of nicotinic receptor locus in YRI and CEU. Linkage disequilibrium (LD) structure for all SNPs in the 500kb (chr15:76507-76985 kb) region (available from HapMap v3 in Haploview (v4.2)) for people of European (1000 Genomes population CEU, A) and African descent (YRI, B). Boxes are shaded according to the standardized LD coefficient, D, with red representing D′=1, and haplotype blocks defined by ‘confidence interval’ parameter (Gabriel, 2002).

156

Figure 36. LD assocation with eQTL strength for CHRNA5, RP-11, and CHRB4. Correlation between -log (p value) and LD (R2) with the top eQTL haplotype: for CHRNA5 rs880395 in skeletal muscle (A. correlation r2=0.96) and nucleus accumbens (B. correlation r2=0.78); for RP11-650L12.2 rs880395 in skeletal muscle (C. correlation r2=0.78) and nucleus accumbens (D. correlation r2 = 0.66); for CHRNB4 rs67292883 in testis (E. correlation r2 = 0.55) and rs4886580 in esophagus (F. correlation r2 = 0.27). Diamonds highlight CHRNA5-enhancer rs880395 (gold), nonsynonymous CHRNA5 rs16969968 (green), CHRNA3-enhancer (brain) rs1948 (red), and a potential additional regulatory haplotype for CHRNA5, and RP11-650L12.2 in brain tagged by rs8042374 (blue), CHRNB4 rs67292883 (purple), and CHRNB4 rs4886580 (peach).

157

Figure 37. CHRNA3 eQTLs replicate in Braineac. Microarray data from braineac.org provide gene expression across ten brain regions. Plot shows the association between CHRNA3 mRNA expression and rs1948 in different brain regions (CRBL = Cerebellar cortex, FCTX = Frontal cortex, HIPP = hippocampus, MEDU = medulla, OCTX = Occipital cortex, PUTM = putamen, SNIG = substantia nigra, TCTX = temporal cortex, THAL = thalamus, WHMT = intralobular white matter).

158

Figure 38. Allelic expression ratios for nicotinic receptor locus. More sensitive than eQTL analysis, measurement of allelic ratios can reveal the presence of cis-acting regulatory variants. In genomic DNA, each allele is present in equal ratios, while deviation from unity in mRNA (allelic mRNA expression imbalance or AEI) reveals the presence of regulatory variants. For CHRNA3, CHRNA5, and CHRNB4, there is robust AEI, but not for PSMA4. Allelic ratios are plotted for all marker SNPs in PSMA4 (A) and CHRNA3 (B) in nucleus accumbens. SNPs with equal representation of the reference and alternate alleles have a ratio of 0.5 (reference/total), with deviations demonstrating AEI. CHRNA3 shows robust allelic ratios in nucleus accumbens measured at 3 different SNPs (B), indicating the presence of a strong cis-acting regulatory element. Although there are only three subjects and three different marker SNPs (rs1051730, rs3743075, rs8040868 all in strong LD) these samples are also heterozygous for rs1948, the top eQTL for CHRNA3 in the basal ganglia, supporting the rs1948-tagged haplotype as the regulatory element (Figure 23). Allelic ratios are also measured at rs16969968 (C) and rs1948 (D) across different tissues. Each dot represents the allelic ratio for an individual in the given tissue. In C, the color indicates the genotype for the CHRNA5-enhancer haplotype: red, green and blue are 0, 1 or 2 minor alleles respectively. In most tissues, significant AEI is accounted for by the CHRNA5-enhancer haplotype, but not in every tissue (colon and LCLs). In D, CHRNB4 mRNA exhibits evidence of AEI, at rs1948, in several tissues. Allelic ratios consistently below 0.4 in adrenal tissue suggest this SNP is in strong disequilibrium with a functional variant affecting CHRNB4 expression in adrenal tissue.

159

Figure 39. LD between candidate SNPs in Africans with expression, regulatory, and clinical associations annotated. SNPs prioritized based on GWAS hits, eQTLs, allelic imbalance and literature are grouped by their LD (R2) in YRI population. Note that different sources of evidence (GWAS, eQTL, etc.) are not YRI specific. Dendrogram on the top depicts relatedness of groups of SNPs. Although the majority of GWAS studies focus on Caucasian populations, 3 studies specific to African populations (Genome-wide association study of coronary heart disease and its risk factors in 8,090 African Americans: the NHLBI CARe Project, Genome-wide association study identifies genetic variants influencing F-cell levels in sickle-cell patients, Genome-wide meta-analyses of smoking behaviors in African Americans) identified GWAS hits in this locus. These variants are among the same identified for Caucasians (rs1051730, rs8034191, rs667282, rs2036527, rs16969968). 160

Figure 40. Association between GWAS p-value and linkage disequilibrium (R2) with top scoring SNP indicates multiple functional haplotypes in CETP locus. Each plot represents a different study x phenotype combination. Each point represents a SNP in the CETP locus that was considered by the given study. The p-value for the trait of interest is plotted against the linkage disequilibrium of the SNP with the top scoring SNP (most significant p-value) in the study. Data was taken from GRASP-DB, which curates GWAS data.

161

Figure 41. Strong negative LD betetween rs247616 and rs6499863 observed in 2 different cohorts: GTEx and 1000 genomes.

162

Appendix F: Supplemental Table

163

Table 10. Annotation and gene information for GWAS variants. The GWAS gene assignment is as reported in ‘A comprehensive 1000 genomes based genome-wide association meta-analysis of coronary artery disease’ ( the Cardi. Consortium, 2015). eGenes and tissues are derived from GTEx (Lonsdale et al., 2013). Braenne et Al reprioritized genes are taken from Supplementary Figure 7 of 2015 paper entitled ‘Prediction of Causal Candidate Genes in Coronary Artery Disease Loci’ (Brænne et al., 2015). GWAS Catalog includes all traits associated with SNP in the NHGRI-EBI GWAS Catalog. GO Groupings are based on the clustering analysis performed (Figure 4).

Locus SNP GWAS Gene eGenes Tissues Braenne et Al. GWAS Catalog GO Number Assignment Reprioritized Group Gene

1 rs68170813 7q22 -

rs2891168 9p21 CDKN2A/B -

3 rs199884676 ABCG5- - 10 164 ABCG8

4 rs2519093 ABO ABO, SURF1 Adrenal Gland, - Venous 8, 9 Heart Atrial thromboembolism Appendage, Whole Blood

5 rs4468572 ADAMTS7 ADAMTS7, Artery Tibial, ADAMTS7, 3, 8 RP11-160C18.2 Muscle Skeletal WDR61

6 rs932344 ADTRP- ADTRP Whole Blood - 9 C6orf105

Continued 165

Table 10 continued

7 rs16986953 AK097927 - 8

8 rs3822921 ANKS1A - 4

9 rs398090189 APOB - 1

10 rs4420638 APOC1 TOMM40 Age-related macular 2 degeneration, Alzheimer's disease, Alzheimer's disease (age of onset), Alzheimer's disease

165 (late onset), Cholesterol, total,

Cognitive decline, C- reactive protein, C- reactive protein levels, HDL cholesterol, LDL cholesterol, Lipid traits, Lipoprotein- associated phospholipase A2 activity and mass, Longevity, Longevity (85 years and older), Longevity (90 years

165 Continued

Table 10 continued and older), Quantitative traits, Triglycerides, Verbal declarative memory

11 rs2681472 ATP2B1 ATP2B1, RP11- Artery Tibial, GALNT4 Diastolic blood 8, 10 981P6.1 Thyroid pressure, Hypertension

12 rs7212798 BCAS3 - 4

13 rs11617955 COL4A1/A2 IRS2 3

14 rs11838776 COL4A1/A2 IRS2 3 166

15 rs1870634 CXCL12 - 3

16 rs11191416 CYP17A1- MARCKSL1P1, Adipose NT5C2, 6, 7, CNNM2- TMEM180, Subcutaneous, CNNM2 8, 12 NT5C2 ARL3, Adipose Visceral MIR1307, Omentum, Adrenal CYP17A1-AS1, Gland, Artery NT5C2, Aorta, Cells WBP1L Transformed fibroblasts, Esophagus Mucosa, Muscle Skeletal, Nerve Tibial, Skin Not Sun Exposed 166 Continued

Table 10 continued Suprapubic, Skin Sun Exposed Lower leg, Testis, Thyroid

17 rs4593108 EDNRA - 12

18 rs1924981 FLT1 - 12

19 rs2521501 FURIN-FES FES, FURIN Adipose FES, FURIN, Blood pressure, 12 Subcutaneous, MAN2A2 Diastolic blood Adipose Visceral pressure,

167 Omentum, Artery Hypertension, Aorta, Artery Systolic blood

Tibial, Breast pressure Mammary Tissue, Cells Transformed fibroblasts, Esophagus Mucosa, Lung, Nerve Tibial, Pancreas, Skin Sun Exposed Lower leg, Thyroid

20 rs72689147 GUCY1A3 RP11-588K22.2 Nerve Tibial - 8, 9

21 rs2107595 HDAC9 - Coronary artery 11 disease or large artery Continued 167

Table 10 continued stroke, Large artery stroke, Stroke (ischemic)

22 rs10139550 HHIPL1 HHIPL1 Adipose YY1 8 Subcutaneous, Cells Transformed fibroblasts

23 rs6689306 IL6R IL6R, TDRD10 Colon Transverse, UBAP2L, 3, 8 Lung, Testis, IL6R, Whole Blood ATP8B2, CHTOP

168 24 rs28451064 KCNE2 - 12 (gene desert)

25 rs56336142 KCNK5 KCNK5 Adipose - 8 Subcutaneous

26 rs2487928 KIAA1462 KIAA1462 Artery Aorta KIAA1462 8

27 rs11830157 KSR2 - 12

28 rs56289821 LDLR SMARCA4 SMARCA4 1

29 rs1412444 LIPA LIPA Adipose LIPA Coronary heart 8 Subcutaneous, disease Adrenal Gland, Colon Transverse,

168 Continued

Table 10 continued Lung, Pancreas, Spleen, Thyroid, Whole Blood

30 rs55730499 LPA SLC22A3 Skin Sun Exposed SLC22A3, 2, 10 Lower leg AL591069.5

31 rs17411031 LPL LPL 1

32 rs8042271 MFGE8- RP11- Artery Aorta, - 3, 8 ABHD2 326A19.4, Artery Tibial MFGE8

33 rs67180937 MIA3 RP11-378J18.8, Adipose AIDA, MIA3, 6, 8, 169 AIDA, MIA3, Subcutaneous, C10rf58 9

FAM177B, Adipose Visceral TAF1A Omentum, Adrenal Gland, Artery Aorta, Artery Tibial, Brain Anterior cingulate cortex BA24, Brain Cerebellar Hemisphere, Brain Cerebellum, Breast Mammary Tissue, Cells Transformed fibroblasts, Colon Sigmoid, Continued 169

Table 10 continued Esophagus Gastroesophageal Junction, Esophagus Mucosa, Esophagus Muscularis, Lung, Muscle Skeletal, Nerve Tibial, Ovary, Skin Sun Exposed Lower leg, Spleen, Testis, 170 Thyroid, Whole

Blood

34 rs139016349 MRAS CEP70, 5 MRAS

35 rs3918226 NOS3 - 12

37 rs2128739 PDGFD - 8

38 rs9349379 PHACTR1 RP1-257A7.5, Artery Aorta, YN62D03 Cervical artery 6, 8 PHACTR1 Artery Coronary, dissection, Coronary Artery Tibial artery calcification, Coronary heart disease, Migraine, Continued 170

Table 10 continued Migraine - clinic- based, Migraine without aura

39 rs2315065 PLG LAPAL2, 2 PLG

40 rs663129 PMAIP1- - 4, 12 MC4R

41 rs180803 POM121L9P- - 8, 12 ADORA2A

171 42 rs9970807 PPAP2B - 12

43 rs7214245 RAI1-PEMT- TOM1L2, Muscle Skeletal, TOM1L2 4, 7, RASD1 AC122129.1 Pituitary 8, 12

44 rs17087335 REST-NOA1 RP11-738E22.3 Skin Sun Exposed REST 6, 8, Lower leg 11

45 rs11065979 SH2B3 ALDH2 Skin Sun Exposed C12orf30 6, 8 Lower leg

46 rs273909 SLC22A4- SLC22A5, Cells Transformed - 8, 10 SLC22A5 AC034220.3 fibroblasts, Muscle Skeletal, Nerve Tibial, Stomach, Thyroid

Continued 171

Table 10 continued 47 rs56062135 SMAD3 SMAD3 Thyroid - 11, 11

48 rs9914266 SMG6 SRR, SGSM2, Adipose - 4, 8, RP1-59D14.5, Subcutaneous, 12 AC090617.1 Adrenal Gland, Artery Tibial, Cells Transformed fibroblasts, Colon Transverse, Esophagus Mucosa, Esophagus

172 Muscularis, Lung, Skin Sun Exposed

Lower leg, Testis

49 rs7528419 SORT1 PSRC1, Esophagus CELSR2, Lipoprotein- 4, 12 CELSR2, Mucosa, Liver, SORT1, associated SORT1 Muscle Skeletal, PSRC1 phospholipase A2 Pancreas, Skin Sun activity and mass Exposed Lower leg, Testis, Whole Blood

50 rs10840293 SWAP70 SWAP70 Skin Not Sun AF075116, 12 Exposed SWAP70 Suprapubic, Spleen

Continued 172

Table 10 continued 51 rs2001846 TRIB1 - 4

52 rs35895680 UBE2Z ATP5G1, Brain Cerebellum, SAPG9, 4, 6, UBE2Z, SNF8 Heart Left ATP5G1, 7 Ventricle, Muscle UBE2Z, Skeletal, Skin Sun CALCOCO2, Exposed Lower GIP, DLX4 leg, Thyroid

53 rs7568458 VAMP5- GGCX, Adipose GGCX, 9, 12 VAMP8- VAMP8, Subcutaneous, VAMP8,

173 GGCX TMEM150A Adipose Visceral VAMP5 Omentum, Artery

Aorta, Artery Tibial, Brain Anterior cingulate cortex BA24, Brain Cerebellar Hemisphere, Brain Cerebellum, Brain Cortex, Brain Hypothalamus, Breast Mammary Tissue, Cells EBV-transformed lymphocytes, Cells Transformed fibroblasts, Colon Transverse, 173 Continued

Table 10 continued Esophagus Mucosa, Esophagus Muscularis, Lung, Muscle Skeletal, Nerve Tibial, Pancreas, Pituitary, Skin Not Sun Exposed Suprapubic, Skin Sun Exposed 174 Lower leg, Spleen,

Stomach, Testis, Thyroid, Whole Blood

54 rs201810558 WDR12 ICA1L, 6 WDR12, CARF, ALS2CR8, NBEAL1

55 rs11556924 ZC3HC1 ZC3HC1 Coronary artery 6 disease, Coronary artery disease or ischemic stroke, Coronary artery disease or large artery stroke, Coronary 174 Continued

Table 10 continued heart disease

56 rs17678683 ZEB2- - 4 ACO74093.1

57 rs964184 ZNF259- - Blood metabolite 1, 4 APOA5- levels, Cholesterol, APOA1 total, Circulating phylloquinone levels, Coronary artery

175 disease or ischemic stroke, Coronary

artery disease or large artery stroke, Coronary heart disease, HDL cholesterol, Hypertriglyceridemia, LDL cholesterol, Lipoprotein- associated phospholipase A2 activity and mass, Metabolic syndrome, Metabolite levels, Phospholipid levels (plasma), Response to Continued 175

Table 10 continued Vitamin E supplementation, Triglycerides, Very long-chain saturated fatty acid levels (fatty acid 20:0), Vitamin E levels

58 rs12976411 ZNF507- - 4, 8 LOC400684 176

176

Table 11: GO Cluster Details. Most frequent GO terms (and corresponding genes) for each cluster.

GO Cluster #1: LDLR, APOA5, LPL, APOB, APOA1 GO Term Genes lipoprotein metabolic process LDLR, APOA5, LPL, APOB, APOA1 protein binding LDLR, APOA5, LPL, APOB, APOA1 small molecule metabolic process LDLR, APOA5, LPL, APOB, APOA1 cholesterol homeostasis LDLR, APOA5, LPL, APOB extracellular region LDLR, APOA5, LPL, APOA1 extracellular space LDLR, APOA5, LPL, APOA1 phototransduction, visible light LDLR, LPL, APOB, APOA1 plasma membrane LDLR, LPL, APOB, APOA1 retinoid metabolic process LDLR, LPL, APOB, APOA1 very-low-density lipoprotein particle LDLR, APOA5, LPL, APOA1 GO Cluster #2: APOC1, LPA, PLG, PCSK9 GO Term Genes extracellular region APOC1, LPA, PLG, PCSK9 apolipoprotein binding LPA, PLG, PCSK9 lipoprotein metabolic process APOC1, LPA, PLG proteolysis LPA, PLG, PCSK9 serine-type endopeptidase activity LPA, PLG, PCSK9 apoptotic process PLG, PCSK9 catalytic activity LPA, PCSK9 cell surface PLG, PCSK9 cholesterol metabolic process APOC1, PLG endoplasmic reticulum APOC1, PLG extracellular space PLG, PCSK9 extrinsic to external side of plasma PLG, PCSK9 membrane lipid metabolic process APOC1, LPA protein binding PLG, PCSK9 triglyceride metabolic process APOC1, PLG GO Cluster #3: CXCL12, IL6R, ADAMTS7, MFGE8, COL4A1, COL4A2 GO Term Genes extracellular region IL6R, ADAMTS7, MFGE8, COL4A1, COL4A2 protein binding CXCL12, IL6R, ADAMTS7, COL4A1, COL4A2 extracellular matrix CXCL12, IL6R, ADAMTS7, COL4A2 angiogenesis IL6R, ADAMTS7, COL4A2

177 Continued

Table 11 continued extracellular space MFGE8, COL4A1, COL4A2 axon guidance IL6R, ADAMTS7 basement membrane IL6R, ADAMTS7 cell adhesion MFGE8, COL4A2 cell surface CXCL12, COL4A1 collagen IL6R, ADAMTS7 collagen catabolic process IL6R, ADAMTS7 collagen type IV IL6R, ADAMTS7 endoplasmic reticulum lumen IL6R, ADAMTS7 external side of plasma membrane MFGE8, COL4A2 extracellular matrix disassembly IL6R, ADAMTS7 extracellular matrix organization IL6R, ADAMTS7 extracellular matrix structural constituent IL6R, ADAMTS7 plasma membrane MFGE8, COL4A1 positive regulation of cell proliferation MFGE8, COL4A1 GO Cluster #4: ZNF507, ZEB2, TRIB1, PSRC1, PMAIP1, SNF8, SMG6, ZNF259, RAI1, ANKS1A, BCAS3 GO Term Genes nucleus ZNF507, ZEB2, TRIB1, PMAIP1, SNF8, SMG6, ZNF259,RAI1, ANKS1A, BCAS3 cytoplasm ZNF507, ZEB2, PSRC1, PMAIP1, SNF8, ZNF259, RAI1, ANKS1A protein binding ZNF507, ZEB2, TRIB1, PSRC1, PMAIP1, SNF8, SMG6, ANKS1A cytosol TRIB1, PSRC1, SNF8, SMG6 metal ion binding SNF8, RAI1, BCAS3 nucleolus SNF8, RAI1, ANKS1A transcription, DNA-templated SMG6, RAI1, BCAS3 GO Cluster #5: MRAS GO Term Genes actin cytoskeleton organization MRAS GTP binding MRAS GTP catabolic process MRAS GTP-dependent protein binding MRAS GTPase activity MRAS intracellular MRAS intracellular protein transport MRAS

178 Continued

Table 11 continued membrane MRAS multicellular organismal development MRAS muscle organ development MRAS nucleocytoplasmic transport MRAS plasma membrane MRAS protein transport MRAS Ras protein signal transduction MRAS signal transduction MRAS small GTPase mediated signal transduction MRAS GO Cluster #6: NOA1, SH2B3, NT5C2, PHACTR1, UBE2Z, TAF1A, AIDA, WDR12, ZC3HC1, TCF21 GO Term Genes protein binding NOA1, SH2B3, NT5C2, UBE2Z, TAF1A, AIDA, ZC3HC1, TCF21 nucleus PHACTR1, AIDA, WDR12, ZC3HC1, TCF21 cytoplasm NOA1, PHACTR1, WDR12 cytosol NT5C2, PHACTR1, UBE2Z apoptotic process SH2B3, WDR12 nucleoplasm TAF1A, ZC3HC1 GO Cluster #7: CYP17A1, ATP5G1, PEMT GO Term Genes mitochondrion CYP17A1, ATP5G1, PEMT small molecule metabolic process CYP17A1, ATP5G1, PEMT endoplasmic reticulum membrane ATP5G1, PEMT integral to membrane CYP17A1, PEMT GO Cluster #8: HHIPL1, PDGFD, TOM1L2, SGSM2, RP11 -160C18.2, RP11-326A19.4, RP11- 378J18.8, RP11-588K22.2, RP11-738E22.3, RP11-981P6.1, RP1-257A7.5, RP1-59D14.5, POM121L9P, MIR1307, LOC400684, FAM177B, CYP17A1-AS1, AK097927, AC122129.1, AC090617.1, ABO, AC034220.3, KIAA1462, TDRD10, KCNK5, ALDH2, LIPA, ABHD2, CNNM2, TMEM180, WBP1L GO Term Genes integral to membrane HHIPL1, RP11-588K22.2, CNNM2, WBP1L biological_process HHIPL1, RP1-257A7.5 carbohydrate metabolic process RP11-378J18.8, RP1-257A7.5 extracellular region RP1-257A7.5, CYP17A1-AS1 membrane RP1-257A7.5, CYP17A1-AS1 molecular_function HHIPL1, RP1-257A7.5 GO Cluster #9: VAMP5, MIA3, SURF1, GUCY1A3,GGCX, TMEM150A, ADTRP, C6orf105 179 Continued

Table 11 continued GO Term Genes integral to membrane VAMP5, MIA3, SURF1, GUCY1A3, GGCX, TMEM150A, ADTRP, C6orf105 plasma membrane VAMP5, MIA3, ADTRP, C6orf105 blood coagulation SURF1, GUCY1A3 endoplasmic reticulum membrane SURF1, GGCX membrane SURF1, TMEM150A protein binding GGCX, TMEM150A GO Cluster #10: ABCG5, ABCG8, SLC22A5, SLC22A4, SLC22A3, ATP2B1 GO Term Genes integral to membrane ABCG5, ABCG8, SLC22A5, SLC22A4, SLC22A3, ATP2B1 plasma membrane ABCG5, ABCG8, SLC22A5, SLC22A4, SLC22A3, ATP2B1 protein binding ABCG5, ABCG8, SLC22A5, SLC22A4, SLC22A3, ATP2B1 transmembrane transport ABCG5, ABCG8, SLC22A5, SLC22A4, SLC22A3, ATP2B1 GO Cluster #11: SMAD3, HDAC9, REST GO Term Genes cytoplasm SMAD3, HDAC9, REST negative regulation of transcription from SMAD3, HDAC9, REST RNA polymerase II promoter nucleus SMAD3, HDAC9, REST protein binding SMAD3, HDAC9, REST transcription factor binding SMAD3, HDAC9, REST chromatin binding HDAC9, REST cytosol HDAC9, REST DNA binding HDAC9, REST histone H4 deacetylation SMAD3, HDAC9 metal ion binding SMAD3, HDAC9 negative regulation of cell proliferation HDAC9, REST negative regulation of transcription, DNA- SMAD3, HDAC9 dependent nucleoplasm SMAD3, REST positive regulation of transcription, DNA- HDAC9, REST dependent Continued 180

Table 11 continued regulation of transcription, DNA-dependent HDAC9, REST sequence-specific DNA binding HDAC9, REST transcription factor activity transcription factor complex SMAD3, REST transcription regulatory region DNA binding HDAC9, REST transcription, DNA-templated SMAD3, REST GO Cluster # 12: RASD1, ARL3, FLT1, FES, KSR2, NOS3, SWAP70, BSND, SRR, SORT1,EDNRA, ADORA2A, FURIN, MC4R, CELSR2, PPAP2B, KCNE2, VAMP8 GO Term Genes protein binding RASD1, ARL3, FLT1, FES, KSR2, NOS3, SWAP70, BSND, SRR, SORT1, EDNRA, ADORA2A, FURIN, MC4R, CELSR2, PPAP2B, KCNE2, VAMP8 plasma membrane RASD1, ARL3, FLT1, FES, KSR2, SWAP70, BSND, SRR, SORT1, EDNRA, ADORA2A, FURIN, MC4R, CELSR2, PPAP2B, KCNE2, VAMP8 cytoplasm ARL3, FLT1, FES, NOS3, SWAP70, SORT1, ADORA2A, CELSR2, PPAP2B, KCNE2 integral to membrane RASD1, FES, KSR2, BSND, SRR, EDNRA, FURIN, CELSR2, VAMP8 membrane RASD1, FES, KSR2, NOS3, SRR, SORT1, FURIN, MC4R, VAMP8 G-protein coupled receptor signaling RASD1, ARL3, FES, KSR2, EDNRA, MC4R, pathway CELSR2 ATP binding ARL3, NOS3, SWAP70, SORT1, PPAP2B, KCNE2 nucleus ARL3, SWAP70, EDNRA, ADORA2A, MC4R, KCNE2 Golgi apparatus ARL3, NOS3, SWAP70, FURIN, CELSR2 G-protein coupled receptor activity RASD1, FES, KSR2, EDNRA integral to plasma membrane RASD1, FLT1, KSR2, SWAP70 aging KSR2, SRR, PPAP2B Continued calcium ion binding FES, PPAP2B, KCNE2 cell proliferation KSR2, NOS3, BSND cell surface BSND, SRR, CELSR2 cytosol NOS3, SORT1, ADORA2A Golgi membrane ARL3, BSND, ADORA2A Continued 181

Table 11 continued metal ion binding ARL3, BSND, SORT1 neuronal cell body RASD1, CELSR2, PPAP2B neurotrophin TRK receptor signaling RASD1, BSND, CELSR2 pathway protein kinase activity NOS3, SWAP70, SORT1 protein phosphorylation NOS3, SWAP70, SORT1 protein transport ARL3, MC4R, VAMP8 protein tyrosine kinase activity NOS3, SWAP70, SORT1 transferase activity, transferring NOS3, SWAP70, SORT1 phosphorus-containing groups

182

Table 12. GWAS-based candidate genes. The CardioGRAMC4D Consortium reported 50 loci implicated in risk of CAD and assigned these variants to one or more genes in the locus. Included is a summary table of the variants and assigned genes. Also included is any evidence identified by GTEx for an eQTL associated with expression of the assigned genes and the linkage disequilibrium structure between these eQTLs and the reported GWAS variants.

Percent eQTLs in eQTL in Minimum p- GTEx Nearby Significant LD with Locus SNP GTEx (for value in eQTLs (for Notes Genes Permutations GWAS hit SNP) CATHGEN gene) in FHS (r2 > 0.8)

1 rs3798220 None LPAL2 0.01 1% None n/a Other nearby genes: SLC22A3, LPA rs2048327 None 2 rs515135 None APOB 0.95 6% Adipose, No Heart Left Ventricle,

183 Skin Sun Exposed

3 rs1122608 None LDLR 0.39 28% None n/a 4 rs2075650 None APOE 0.20 2% None n/a Other nearby genes: APOC1 5 rs11206510 PCSK9 PCSK9 0.77 12% Nerve Tibial GWAS hit (Nerve Tibial) is eQTL in Tibial Nerve 6 rs10808546 None TRIB1 0.86 100% None No rs2954029 None 7 rs12413409 None CYP17A1 0.70 31% None n/a CNNM2 0.02 1% None n/a NT5C2 0.25 71% None n/a

185 Continued

Table 12. GWAS-based candidate genes continued 8 rs7692387 None GUCY1A3 0.00 3% Artery Tibial No Artery Tibial: rs6847136 (r2 = 0.001, d' = 1) 9 rs17514846 None FURIN 0.00 28% None n/a FES 0.03 97% Nerve Moderate Nerve Tibial: rs11638635 (r2 = Tibial, 0.656, d' = 1) Thyroid Thyroid: rs12906125 (r2 = 0.442, d' = 1) 10 rs17465637 None MIA3 0.03 1% None n/a 11 rs1746048 None CXCL12 0.56 14% Esophagus No GWAS hits not tightly linked (r2 = Mucosa 0.049, d' = 0.636)

rs2047009 None

184 12 rs6725887 NBEAL1 WDR12 0.33 34% None n/a (Artery Tibial), ICA1L (Thyroid)

13 rs12526453 None PHACTR1 0.04 88% Blood No Blood: rs12205644 (r2 = 0.085, d' = 0.524) 15 rs2306374 MRAS (Artery MRAS 0.02 3% Artery GWAS hit GWAS hits are in perfect LD (r2 = Aorta) Aorta, is eQTL in 1, d' = 1) Artery Tibial Aortic Artery rs9818870 MRAS (Artery Aorta)

16 rs17114036 None PPAP2B 0.40 19% None n/a 17 rs12190287 None TCF21 0.16 0% None n/a 18 rs11556924 None ZC3HC1 0.01 2% None n/a

Continued

184

Table 12. GWAS-based candidate genes continued

19 rs1412444 LIPA (Blood, LIPA 0.06 21% Blood, GWAS hit GWAS hits not tightly linked (r2 = Lung) Lung, is eQTL in 0.392, d' = 0.788) Muscle Blood and Skeletal Lung

rs11203042 None 20 rs974819 None PDGFD 0.03 1% None n/a 21 rs4773144 None COL4A1 0.24 3% None n/a COL4A2 0.51 0% None n/a

185 22 rs3825807 None ADAMTS7 0.36 3% None n/a GWAS hits not tightly linked (r2 = 0.385, d' = 0.631) rs7173743 None 23 rs216172 None SMG6 0.82 45% None n/a Other nearby genes: SRR 24 rs12936587 None RASD1 0.09 1% None n/a RAI1 0.01 29% None n/a PEMT 0.01 1% Blood No Blood: rs7209003 (r2 = 0.075, d' = 0.475) 25 rs46522 UBE2Z UBE2Z 0.02 19% Blood, Skin GWAS hit Other nearby genes: GIP, ATP5G1, (Blood) Sun is eQTL in SNF8 Exposed Blood

26 rs4845625 TDRD10 IL6R 0.05 11% None n/a (Lung)

27 rs1878406 None EDNRA 0.92 3% None n/a 28 rs2023938 None HDAC9 0.06 5% None n/a

Continued

185

Table 12. GWAS-based candidate genes continued 29 rs1561198 GGCX GGCX 0.72 1% ArteryAorta, GWAS hit Other nearby genes: VAMP5 (Blood, Artery Artery is eQTL in Aorta, Artery Tibial, Blood, Tibial), Blood, Aortic VAMP8 Lung, Artery, and (Esophagus Nerve Tibial Mucosa) Tibial, Artery Thyroid, VAMP8 1.00 100% Esophagus GWAS hit Mucosa is eQTL in Esophagus

186 30 rs2252641 None ZEB2 0.32 76% None n/a Other nearby genes: AC074093.1

31 rs273909 None SLC22A5 0.10 5% Blood, No Other nearby genes: SLC22A4 Esophagus Blood: rs2631367 (r2 = 0.156, d' = Mucosa, 1) Lung, Skin Esophagus Mucosa: rs10065787 Sun (r2 = 0.108, d' = 1) Exposed Lung: rs6860806 (r2 = 0.220, d' = 1) Skin Sun Exposed: rs2631367 (r2 = 0.156, d' = 1) 32 rs4252120 None PLG 0.05 2% None n/a 33 rs264 None LPL 0.43 26% None n/a

35 rs9982601 None KCNE2 0.23 12% None n/a Other nearby genes: MRPS6 36 Other nearby genes: ZNF259, rs9326246 None APOA1 0.88 1% None n/a APOA5, APOA4, APOC3 rs964184 GWAS hits are moderately linked None (r2 = 0.63, d' = 1)

186

Table 13. Epistatic pairs. Percent significant (p < 0.05) permutations (out of 5000) in bootstrapped conditional regression for interaction terms between candidate SNPs.

#pair 1: VAMP8|PHACTR1 GWAS Hits rs1010 rs9369640 rs1010*rs9369640 6% 1% 7% eQTLs rs1009 rs7774863 rs1009*rs774863 20% 9% 47% eQTLs rs1009 rs381134 rs1009*rs381134 19% 8% 47% eQTLs rs1009 rs2439538 rs1009*rs2439538 14% 8% 39% eQTLs rs1010 rs7774863 rs1010*rs774863 21% 9% 48% eQTLs rs1010 rs381134 rs1010*rs381134 20% 8% 47% eQTLs rs1010 rs2439538 rs1010*rs2439538 15% 9% 43% eQTLs rs6757263 rs7774863 rs6757263*rs774863 21% 10% 55% eQTLs rs6757263 rs381134 rs6757263*rs381134 20% 11% 53% eQTLs rs6757263 rs2439538 rs6757263*rs2439538 15% 11% 45% #pair 2: VAMP8|ZEB2 GWAS Hits rs1010 rs17514846 rs1010*rs17514846 20% 29% 25% eQTLs rs1009 rs2677737 rs1009*rs2677737 1% 18% 1% eQTLs rs1010 rs2677737 rs1010*rs2677737 0% 17% 1% eQTLs rs6757263 rs2677737 rs6757263*rs2677737 0% 21% 1% #pair 3: PHACTR1|FES GWAS Hits rs9369640 rs17514846 rs9369640*rs17514846 20% 29% 25% eQTLs rs7774863 rs2677737 rs7774863*rs2677737 0% 0% 0% eQTLs rs381134 rs2677737 rs381134*rs2677737 1% 20% 1% eQTLs rs2439538 rs2677737 rs2439538*rs2677737 2% 7% 1% #pair 4: VAMP8|FES GWAS Hits rs1010 rs17514846 rs1010*rs17514846 20% 29% 25% eQTLs rs1009 rs2677737 rs1009*rs2677737 1% 18% 1% eQTLs rs1010 rs2677737 rs1010*rs2677737 0% 17% 1% eQTLs rs6757263 rs2677737 rs6757263*rs2677737 0% 21% 1% #pair 5: MIA3|GUCY1A3 GWAS Hits rs17465637 rs7692387 rs17465637*rs7692387 48% 8% 32% 187

Table 14. CHRNA5-enhancer locus. Most significant eQTLs reported by GTEx for nicotinic locus in each gene x tissue combination available. Bold if the most significant eQTL for the tissue is accounted for by the CHRNA5-enhancer locus (rs880395). The effect of this SNP is negative for all genes except PSMA4.

Gene Tissue p value Effect Size

Muscle - Skeletal 8.30E-40 -0.79

Brain - Caudate (basal ganglia) 1.10E-11 -0.72

CHRNA3 Brain - Nucleus accumbens (basal ganglia) 4.80E-08 -0.74

Brain - Putamen (basal ganglia) 6.60E-08 -0.60

Testis 9.7E-06 -0.50

Adipose 5.40E-38 -0.85

Artery - Aorta 1.40E-07 -0.37

Artery - Coronary 1.60E-09 -0.62

Artery - Tibial 3.50E-21 -0.63

Brain - Anterior cingulate cortex (BA24) 3.00E-12 -0.91

Brain - Caudate (basal ganglia) 3.40E-12 -0.79

Brain - Cerebellar Hemisphere 1.40E-16 -0.84

Brain - Cerebellum 5.10E-14 -0.87 CHRNA5 Brain - Cortex 4.00E-14 -0.97

Brain - Frontal Cortex (BA9) 6.50E-14 -0.82

Brain - Hippocampus 7.60E-13 -0.82

Brain - Hypothalamus 1.50E-09 -0.67

Brain - Nucleus accumbens (basal ganglia) 5.10E-10 -0.75

Brain - Putamen (basal ganglia) 4.80E-12 -0.82

Breast - Mammary Tissue 1.70E-14 -0.62

Cells - Transformed 7.50E-51 -0.69 fibroblasts Continued 188

Table 14 continued

Esophagus - Mucosa 5.10E-15 -0.36

Heart - Atrial Appendage 1.10E-19 -0.85

Heart - Left Ventricle 1.50E-30 -0.81

Lung 1.80E-18 -0.6

Muscle - Skeletal 4.90E-91 -1.1

Nerve - Tibial 2.30E-44 -0.92

Skin 4.00E-16 -0.51

Thyroid 1.70E-07 -0.39

Whole Blood 1.40E-07 -0.37

Adipose 1.70E-06 0.17

Artery - Aorta 7.80E-06 0.16

Artery - Tibial 6E-06 0.15

PSMA4 Heart - Left Ventricle 1.10E-11 0.41

Muscle - Skeletal 2.30E-15 0.23

Nerve - Tibial 6.40E-08 0.19

Skin 2.60E-06 0.17

Adipose 3.30E-35 -0.82

Artery - Aorta 2.70E-06 -0.32

Artery - Coronary 3.90E-08 -0.56

Artery - Tibial 1.50E-18 -0.59

RP11-650L12.2 Brain - Anterior cingulate cortex (BA24) 8.60E-09 -0.8

Brain - Caudate (basal ganglia) 1.70E-12 -0.85

Brain - Cerebellar Hemisphere 8.50E-09 -0.55

Brain - Cortex 8.40E-13 -0.95

Brain - Frontal Cortex (BA9) 5.00E-13 -0.8

Continued 189

Table 14 continued

Brain - Hippocampus 1.20E-10 -0.79

Brain - Hypothalamus 2.60E-07 -0.56

Brain - Nucleus accumbens (basal ganglia) 3.10E-11 -0.79

Brain - Putamen (basal ganglia) 9.90E-12 -0.88

Breast - Mammary Tissue 6.50E-14 -0.6

Cells - Transformed 4.70E-52 -0.69 fibroblasts Esophagus - Mucosa 2.20E-12 -0.35

Heart - Atrial Appendage 4.90E-17 -0.83

Heart - Left Ventricle 6.60E-27 -0.81

Lung 2.00E-20 -0.64

Muscle - Skeletal 5.60E-83 -1

Nerve - Tibial 1.70E-40 -0.9

Skin 2.40E-14 -0.45

Thyroid 1.50E-08 -0.42

Whole Blood 6.10E-07 -0.35

190

Table 15. Evidence sources for candidate SNPs in nicotinic receptor locus.

Source SNPs (1) Top eQTLs rs4886580, rs4887074, rs880395, rs7164030, rs1979905, rs1979906, rs1979907, rs905740, rs4887064, rs12907966, rs1948 (2) Multi-gene eQTLs rs12915669, rs28661610, rs2869552, rs2869553, rs62010552 (3) eQTL/GWAS overlap rs8040868, rs1814880, rs11852909, rs2017091, rs61390267, rs7176070, rs12899940, rs12907511, rs11632672, rs7182993, rs7182694, rs7181486, rs17405217, rs17483548, rs55781567, rs55853698, rs8040868, rs1051730, rs2036527, rs6495309, rs660652, rs55781567, rs55853698, rs55958997, rs11858836, rs8034191, rs17486278, rs1051730, rs12914385, rs8042374, rs8040868, rs11858836, rs8034191, rs2036527, rs17486278, rs1051730, rs12914385, rs55958997, rs8042374, rs667282, rs899997 (4) Allelic ratios rs11633223, rs8040868, rs55781567, rs55853698, rs1948, rs16969968, rs1051730, rs3743075, rs8040868, rs8192475, rs386616281 (5) Popular in the rs16969968, rs6495309, rs660652, rs55781567, rs55853698, literature rs8040868, rs1051730, rs2036527, rs16969968, rs880395, rs7164030, rs1979905, rs1979906, rs1979907, rs905740, rs588765, rs578776, rs12914385

191

Table 16. SNPs marking functional haplotypes and minor allele frequencies (MAFs) from different 1000 Genomes populations.

rs880395 rs16969968 rs1948

CHRNA5-enhancer CHRNA5 nonsynonymous CHRNA3 brain Population G/A G/A eQTL G/A European 38% 37% 32% African 16% 2% 22% East Asian 15% 3% 48% American 20% 21% 15% South Asian 36% 18% 33%

192

Appendix G: Introduction of network construction algorithm

Model

We assume that the joint probability mass function of the expression of m transcripts !!, … , !! may be represented by individual expression levels !!, ! =

1,2, … , ! and pairwise interactions (co-expressions) !!, !! , ! ≠ !, !, ! = 1,2, …,. This is usually stated in terms of the Gibbs measure (Kindermann & Snell, 1980), or the

Hamiltonian of the system, as follows:

1 ! ! = ! … , ! = ! = !"# − ! ! − ! ! , ! ! !, ! ! ! ! ! !" ! ! where Z is the normalization constant!!, !!", ! ≠ !, !, ! = 1,2, … , ! are potential functions. Such assumptions have been exploited in many methods for building genetic co-expression networks (Margolin et al., 2006) and also in the theory of Markov random fields (Kindermann & Snell, 1980).

Measure of co-expression

To quantify co-expression of transcripts we use the generalized one parameter family of Renyi divergence measures (van Erven & Harremoes, 2014):

! ! ∥ ! = ! !" !! !!!! where ! = ! … , ! , ! = ! … , ! ! !!! !, ! !, ! 193 are probability vectors. For discrete distributions, the Renyi divergence shares a wide variety of properties with the standard Kullback-Leibler divergence (van Erven &

Harremoes, 2014) and the chi-square distance. We quantify the strength of pairwise interactions (RNA co-expression) by measuring the ‘distance’ (Renyi divergence) between the (marginalized) bivariate distribution of !!, !! , ! ≠ !, !, ! = 1, … , ! and the product of its marginals:

!! !!, !! : = !! !!!,!! ∥ !!!!!!

1 ! !!! = !" ! ! = ! , ! = ! ! ! = ! ! ! = ! ! − 1 ! ! ! ! ! ! ! !

Following the standard convention for α=1, where the Kullback-Leibler divergence between the bivariate distribution and the product of its marginals is called

‘Mutual Information’, we call the above defined functionals ‘Renyi mutual information’.

It is important to note that the Renyi mutual information equals zero if and only if the bivariate distribution equals the product of its marginals (i.e. the expression levels of the two RNAs are independent).

The parameter α in the Renyi mutual information allows for up or down weighting of negative or positive dependencies in the considered two-dimensional contingency table corresponding to the empirical bivariate distribution (X_i,X_j )[5–7].

Here, we aimed to investigate context-dependent co-expression patterns (the context being varying genetic and environmental backgrounds). Therefore, in addition to the resampling procedure (see Figure 13), we investigated the impact of rare versus frequent 194 bins in the contingency table on co-expression by changing the free parameter α. We selected several different values of this parameter (ranging from 0.1 to 3) and observed that the degree distribution of the graph remained unchained whereas we observed a small stability of individual co-expressed pairs. It remains to be determined whether for specific, small scale systems of RNAs a higher (or lower) stability of interactions may be achieved via the choice of the α parameter. Nevertheless, we find that the impact of frequent versus rare co-expressions may be changed by up- or down-weighting positive and negative dependencies using the free parameter (Douglas, Fienberg, Lee, Sampson,

& Whitaker, 1990).

Data Processing Inequality

Note that we assume that the probability mass function may be written in the

Gibbs form with only first and second order interactions. In such a setting, Margolin et. al. used the standard Mutual Information as measure of co-expression (estimated via the

Gaussian kernel method) and employed the Data Processing Inequality for pruning indirect interactions as part of the ARACNE algorithm for building co-expression networks (Margolin et al., 2006). We note that the Data Processing Inequality holds for the generalized Renyi divergence measures as well (see also Theorem 1 and Theorem 9 in van Erven & Harremoes, 2014). Accordingly, we also employ the Data Processing

Inequality to prune ‘indirect’ edges from the full graph. Therefore, at each repetition of the resampling procedure our algorithm considers all triangles in the full graph and removes the ‘weakest’ edge (i.e. the co-expression with the lowest value) of each 195 triangle, since, in the simplest Markov-type model, the information represented by the weakest edge in a triangle is explained by the two stronger ones (i.e. the weakest interaction is indirect).

196

Appendix H: GTEx Tissues Color Key

197