The Pennsylvania State University

The Graduate School

Eberly College of Science

BEYOND GENOME-WIDE ASSOCIATION STUDIES (GWAS):

EMERGING METHODS FOR INVESTIGATING COMPLEX ASSOCIATIONS FOR

COMMON TRAITS

A Dissertation in

Biochemistry, Microbiology, and Molecular Biology

by

Molly A. Hall

© 2015 Molly A. Hall

Submitted in Partial Fulfillment

of the Requirements

for the Degree of

Doctor of Philosophy

December 2015

The dissertation of Molly A. Hall was reviewed and approved* by the following:

Marylyn D. Ritchie

Paul Berg Professor of Biochemistry and Molecular Biology

Dissertation Adviser

Chair of Committee

Scott B. Selleck

Professor of Biochemistry and Molecular Biology

Head of the Department of Biochemistry and Molecular Biology

Ross Hardison

T. Ming Chu Professor of Biochemistry and Molecular Biology

Santhosh Girirajan

Assistant Professor of Biochemistry and Molecular Biology

Assistant Professor of Anthropology

George H. Perry

Assistant Professor of Anthropology and Biology

Catherine A. McCarty

Principal Research Scientist, Essentia Institute of Rural Health

Special Member

*Signatures are on file in the Graduate School.

ii ABSTRACT

Genome-wide association studies (GWAS) have identified numerous loci associated with human phenotypes. This approach, however, does not consider the richly diverse and complex environment with which humans interact throughout the life course, nor does it allow for interrelationships among genetic loci and across traits. Methods that embrace pleiotropy (the effect of one on more than one trait), -environment (GxE) and gene-gene (GxG) interactions will further unveil the impact of alterations in biological pathways and identify that are only involved with disease in the context of the environment. This valuable information can be used to assess personal risk and choose the most appropriate medical interventions based on an individual’s genotype and environment. Additionally, a richer picture of the genetic and environmental aspects that impact complex disease will inform environmental regulations to protect vulnerable populations. Three key limitations of GWAS lead to an inability to robustly model trait prediction in a manner that reflects biological complexity: 1) GWAS explore traits in isolation, one phenotype at a time, preventing investigators from uncovering relationships that exist among multiple traits; 2) GWAS do not account for the exposome; rather, they simply explore the effect of genetic loci on an outcome; and 3) GWAS do not allow for interactions between genetic loci, despite the complexity that exists in biology. The aims described in this dissertation address these limitations. Methods employed in each aim have the potential to: uncover genetic interactions, unveil complex biology behind phenotype networks, inform public policy decisions concerning environmental exposures, and ultimately assess individual disease- risk.

iii TABLE OF CONTENTS

List of Figures…………………………………………………………………………...... …vii

List of Tables………………………………………………………………………………….…..ix

Acknowledgements………………………………………………………………………………...x

Chapter 1. INTRODUCTION……………………………………………………………………...1

Background………………………………………………………………………………………...2

Genome-Wide Association Studies…...…………………………………………………………...3

Missing Heritability………………………………………………………………………………..6

Interrelationships Across the Phenome…………………………………………………………….8

The Exposome and Gene-Environment Interactions……………………………………………..11

Genetic Interactions………………………………………………………………………………14

Impact…………………………………………………………………………………………….18

Chapter 2. UNVEILING INTERRELATIONSHIPS ACROSS PHENOTYPES USING

PHENOME-WIDE ASSOCIATION STUDIES (PHEWAS)* ………………………………….19

Abstract…………………………………………………………………………………………...20

Introduction……………………………………………………………………………………….21

Methods…………………………………………………………………………………………...23

Results…………………………………………………………………………………………….27

Discussion………………………………………………………………………………………...53

iv Chapter 3. INVESTIGATING THE EXPOSOME AND GENE-ENVIRONMENT

INTERACTIONS USING ENVIRONMENT-WIDE ASSOCIATION STUDIES

(EWAS)*………………………………………………………………………………………….60

Abstract…………………………………………………………………………………………...61

Introduction…………………………………………………………………………………….…62

Methods…………………………………………………………………………………………...64

Results…………………………………………………………………………………………….70

Discussion……………………………………………………………………………………...…74

Chapter 4. KNOWLEDGE-DRIVEN METHOD FOR ASSESSING GENETIC

INTERACTIONS* ………………………………………………………………………………78

Abstract…………………………………………………………………………………………...79

Introduction……………………………………………………………………………………….80

Methods…………………………………………………………………………………………...82

Results…………………………………………………………………………………………….87

Discussion………………………………………………………………………………………...90

Chapter 5. DATA-DRIVEN WEIGHTED ENCODING: A ROBUST APPROACH FOR

DETECTING DIVERSE GENETIC ACTION…………………………………………………..95

Abstract…………………………………………………………………………………………...96

Introduction……………………………………………………………………………………….97

Methods…………………………………………………………………………………………...99

Results…………………………………………………………………………………………...108

Discussion……………………………………………………………………………………….118

v Chapter 6. CONCLUSIONS…………………………………………………………………….122

References……………………………………………………………………………………….131

Appendix………………………………………………………………………………………...147

* Portions of the chapter are from published manuscripts for which Molly Hall is the first author

vi LIST OF FIGURES

Chapter 1

Figure 1.1. The most recent GWAS diagram displaying all SNP-trait associations..……………………….5

Figure 1.2. The number of associations with a p-value < 5×10-8 curated in the GWAS Catalog.…………..7

Chapter 2

Figure 2.1. Overview of the approach for this study.………………………………………………………29

Figure 2.2. Replicating results for PheWAS.……………………………………………………………….32

Figure 2.3. Related results for PheWAS.…………………………………………………………………...34

Figure 2.4. Potentially pleiotropic results..…………………………………………………………………45

Figure 2.5. Sun plot of (p<0.01) results for ABCG rs2231142, coded allele C.……………………………47

Figure 2.6. Sun plot of (p<0.01) results for KCTD10 rs2338104, coded allele G. .………………………..48

Figure 2.7. Sun plot of (p<0.01) results for LIPC rs1800588, coded allele T..…………………………….49

Figure 2.8. Using PheWAS results, Biofilter, and Cytoscape to explore gene-gene connections with NetPath..…………………………………………………………………………………………………….51

Figure 2.9. Using PheWAS results, Biofilter, and Cytoscape to explore gene-gene connections with GO biological processes..………………………………………………………… .……………………………52

Figure 2.10. Using PheWAS results, Biofilter, and Cytoscape to explore gene-gene connections with KEGG connections.………………………………………………………… .……………………………..53

Chapter 3

Figure 3.1. The most significant association results in the Marshfield sample.……………………………70

Figure 3.2. Replicating results of the most significant Marshfield EWAS associations from NHANES III and NHANES 1999-2002.……………………………………………………….………………………….72

Figure 3.3. Manhattan plot of SNPs interacting with Alcohol 30 Day Frequency at a LRT p-value < 1×10-4.…………………………………………………………………………………………….74

Chapter 4

Figure 4.1. Steps involved in generating Biofilter SNP-SNP models...……………………………………86

Figure 4.2. Flow chart of steps in the discovery and replication analyses.………………………………...87

Figure 4.3. All replicating SNP-SNP models with LRT p < 0.01 in both the replication and discovery datasets.……………………………………………………………………………………………………...88

Figure 4.4. Ten most significant replicating SNP-SNP models.………………………………………...…89

vii

Figure 4.5. Common groups relating to genes in replicating SNP-SNP models.…………………………..90

Chapter 5

Figure 5.1. Equations used to assign a data-driven weighted value for the heterozygous genotype……...100

Figure 5.2. Correlations between main effect results obtained using additive, dominant, recessive, and codominant encodings.…………………………………….………………………………………………109

Figure 5.3. Correlations between interaction results obtained using additive, dominant, recessive, and codominant encodings.…………………………………………………………………………………….109

Figure 5.4. Distribution of the estimated heterozygous action (α).……………………………………….110

Figure 5.5. Power plot for each interaction model.……………………………………………………….111

Figure 5.6. ANOVA results from parameter sweep.……………………………………………………...112

Figure 5.7. Average LRT p-value for interaction models at a standardized signal to noise ratio.………..113

Figure 5.8. Average power for standardized signal to noise ratio across all traditional interaction models……………………………………………………………………………………………………...115

Figure 5.9. Average power for standardized signal to noise ratio across all genotype-based interaction models.……………………………………………………………………...……………………………...116

Figure 5.10. Type 1 error in simulated main effect only and null data.……………………………….…117

Figure 5.11. Average and maximum type 1 error………………………………………………………....118

Chapter 6

Figure 6.1. Integration of PheWAS and EWAS to uncover gene-environment interplay underlying pleiotropy.………………………………………………………………………………………………….126

Figure 6.2. Three heterogeneous mechanisms associated with age-related cataract.……………………..128

Figure 6.3. Use of BioBin to combine variants across genes.……………………………………………129

Appendix

Appendix Figure 5.1. Impact of MAF on alpha value.…………………………………………………...182

Appendix Figure 5.1. Expected null distribution of the alpha value by minor allele frequency.……...…182

Appendix Figure 5.3. Power plots at 10% and 50% minor allele frequency.…………………………….183

viii LIST OF TABLES

Chapter 2

Table 2.1. Study population characteristics.………………………………………………………………..28

Table 2.2. Phenotype-classes.………………………………………………………………………………30

Table 2.3. Novel Results.………………………………………………………………………….………..38

Table 2.4. Pleiotropic results.…………………………………………………………….…………………46 .

Chapter 3

Table 3.1. Marshfield type 2 diabetes case/control sample size for each questionnaire.………………65

Table 3.2. EWAS variable classes, specific PhenX Toolkit questions, and the EWAS Marshfield PMRP results.……………………………………………………………………………………………………….71

Chapter 5

Table 5.1. Examples of possible proportional genotype risk underlying genetic loci.…………………….98

Table 5.2. Recessive, additive, dominant, and weighted encoding schemes.…………………………….100

Appendix

Appendix Table 2.1. 80 SNPs chosen for NHANES PheWAS.…………………………………………148

Appendix Table 2.2. PheWAS replication of associations from the GWAS Catalog.…………………...153

Appendix Table 2.3. PheWAS associations related to known results in the GWAS Catalog.…………...172

Appendix Table 2.4. Evaluation of differences in allele frequency for PheWAS-significant SNPs.…….177

Appendix Table 4.1. All 83 replicating SNP-SNP models.………………………………………………179

ix ACKNOWLEDGEMENTS

I am fortunate to have been surrounded by kind, encouraging, and impressive people during graduate school and am grateful for each individual listed here. You have all inspired me. First, I would like to express my deepest gratitude to my doctoral advisor, Dr. Marylyn Ritchie. You have been, in every sense of the word, my mentor: teacher, role model, and supporter. Thank you for being the scientist that I aspire to be. I am honored to have taken this journey with you, and I look forward to being your lifelong student. Dr. Ritchie has assembled the most impressive and motivated group of individuals I have ever worked with. Each member of the Ritchie Lab is dedicated to developing cutting-edge, sophisticated tools to reveal the biological underpinnings of complex disease. This group is solution-focused, collaborative, and supportive. Each lab member has made a lasting impact on my career. Special thanks are in order for certain members, as they have greatly contributed to the work described in this dissertation. Dr. Sarah Pendergrass and Anurag Verma contributed substantially to the PheWAS described in Chapter 2. Thank you to Dr. Pendergrass for mentoring me during my first bioinformatics project and for helping me stay the course. Thank you to Anurag for your patience and collaboration in the early years. Scott Dudek’s efforts – even from afar - were instrumental in the EWAS described in Chapter 3. Alex Frase, a powerhouse programmer, developed Biofilter software, which is described in Chapter 4. Shefali Setia Verma has made a major impact on my experience as a graduate student: she contributed greatly to the analyses described in Chapters 4 and 5, taught me how to analyze genetic data, regularly joined me for stimulating conversations over coffee, and has been an incredible friend. John Wallace made significant contributions to the encoding project described in Chapter 5. The data-driven weighted method is John’s brainchild, an impressive biological encoding tool from a self- described “non-biologist”. I have greatly enjoyed exploring what this tool is capable of. John has provided immense support over the last four years: from resolving program bugs to explaining statistics. John, you are an incredible teacher and a brilliant scientist. Thank you also to Anastasia Lucas, who contributed to research covered in Chapters 3 and 4, and to Yuki Bradford for her efforts toward the work described in Chapter 5. Additionally, I would like to thank Dr. Dokyoon Kim who has been a supportive mentor and friend to me. Finally, I would like to thank Suzy Unger and Donna McMinn for all of their administrative support over the years. I would also like to thank my committee members, who have each provided essential support and feedback – from research to career decisions – with diverse perspectives. Dr. Scott Selleck, Dr. Ross Hardison, Dr. Santhosh Girirajan, Dr. George Perry, and Dr. Catherine McCarty, thank you for serving on my committee. A special thank you to Dr. McCarty, who has supported me from Minnesota – thank you for your kindness and mentorship (and collaboration with Essentia/Marshfield PMRP) over the last four years. Thank you also to my wonderful cohort (Claire Reynolds, Jennifer Thweatt, Dr. Kari Organtini, Mohammad Almishwat, Doug Baumann, Sang Ah Kim, Han Zheng, Melanie McReynolds, Julia Zhu, Djoshkun Shengjuler, Nate Girer, and Sadikshya Adhikary). You have all been an important part of the graduate school experience, especially in the first year and a half. I have never before been with a more positive, inclusive, and intelligent group of students. To my friends and family who have provided strength throughout this process, thank you. Special thanks to Katherine O’Connor, Ruth Ostler, Allison Douglas, Lisa Greene, Margaret Hall, Dr. Richard Hall, Lauren Hall, and Julia Hall. Finally, thank you to my husband, Gordon Hall. We got married and then a week later started graduate school together. Two of the best decisions I’ve ever made. I am grateful for your unwavering support and friendship. The work you do, and the person you are, inspires me. I have enjoyed this PhD journey with you and am excited for the next chapter together.

x

When we try to pick out anything by itself; we find it hitched to everything else in the universe.

- John Muir

xi CHAPTER 1

INTRODUCTION

1 Background

Before the advent of genome-wide association studies (GWAS), the reigning methods for assessing the genetic etiology of human traits included candidate gene studies and linkage analyses. Linkage studies demonstrated considerable success at explaining the genetic component of rare, Mendelian disorders like cystic fibrosis and Huntington’s disease, which involve one or a small set of rare but highly penetrant mutations. For example, linkage studies identified mutations in cystic fibrosis transmembrane conductance regulator (CFTR) that lead to cystic fibrosis1–3.

While these rare diseases were largely explained by linkage studies, common traits like type 2 diabetes, height, body mass index (BMI), and many cancers remained elusive, due to their complex underpinnings.

Genetic association studies offered an alternative to linkage analysis with many advantages, including the fact that, unlike linkage studies, they do not require the recruitment of families, which allowed for studies with larger sample sizes. Generally, genetic association studies involve the comparison of allele frequencies for a given locus between those who have a trait of interest (e.g., type 2 diabetes cases) and those who do not (e.g., type 2 diabetes controls).

For instance, rs380390, an intronic single nucleotide polymorphism (SNP) in compliment factor

H (CFH) has two alleles, guanine (G) and cytosine (C), and thus, its three possible genotypes include: GG, GC, and CC. Klein et al. (2005) found this SNP to be associated with age-related macular degeneration (AMD) (p=4×10-8, OR=4.6), as the C allele was identified in a significantly larger portion of those with AMD than those without, and is, therefore, referred to as the “risk” allele4.

Early on, genetic associations focused on candidate genes, and these, like linkage studies, showed little success for describing the heritability of complex traits. Using this approach, researchers were restricted to assessing the impact of only genetic loci that were expected to play a part in the trait, and this prevented the possibility of identifying novel risk loci. At the time, sequencing and extensive genotyping methods were expensive, while genotyping fewer, select

2 loci was more affordable; thus, candidate studies were ideal. Fortunately, efforts from the Human

Genome Project5 and the International HapMap Project6 made it possible to glean genetic information from across the genome without sequencing. This was done by genotyping a set of

SNPs that provided information about loci in high linkage disequilibrium with the selected

SNPs6, and this allowed a shift toward genome-wide association studies.

Genome-Wide Association Studies

With the knowledge gained from the HapMap project, researchers could genotype only a well-selected, relatively small number of SNPs across the genome: approximately 500,000 (as opposed to the ~3 billion base pairs in the ), and this was more cost effective than sequencing. Genome-wide genotyping platforms were designed to assay targeted SNPs as markers for their respective regions of the genome. Instead of restricting the search to selected candidate genes, SNPs across the genome were considered as putative risk loci, regardless of any known function.

The “Common disease/common variant (CDCV)” hypothesis reigned at the time of

GWAS development, and it underlies the assumptions of the method. CDCV assumes that, because complex diseases are common in the population, the genetic loci responsible are likely to be common as well7. The extreme alternative to CDCV is the “heterogeneity hypothesis”, which presumes that genetic variations at different loci of low frequencies (less than 1% of the population) in different individuals contribute, in combination and independently, to disease-risk

(also known as the “multiple rare-variant hypothesis”)8,9. As has been previously argued by Wang

(2005), the genetic etiology of complex traits is likely to lie between these two contrasting theories, and further, may involve combinations of their components. Nevertheless, GWAS design caters to the CDCV hypothesis, as it is powered to detect signal in common variants. For instance, let us imagine that the true underlying genetic mechanism for a complex disease lies in the signal across multiple, distinct loci. We will also assume we have a study with 1,000 cases

3 and 1,000 controls. If these loci are relatively common in the population, even if added risk for each of the variants is low (say, an odds ratio less than 2 each), the signals will be detected if there is a significant difference in allele frequencies between cases and controls when correcting for multiple, genome-wide tests. This scenario presents a problem, however, if each of the variants is rare in the population (if only a handful of individuals carry the risk allele). If, for example, only 3 samples carry one of the risk variants, this signal will likely go undetected, even if they are all controls, because it does not explain the other 997 cases. We can, therefore, imagine a scenario under which small groups of cases each carry different variants at distinct loci, whose signal would go undetected using association methods. Since it was expected that common variants were where the heritability of common traits would lie, GWAS genotyping platforms and analyses cater to common SNPs and miss variants of low frequency.

This was not considered a problem at first because early GWAS showed promise of the method’s ability to detect risk alleles of high effect. One of the first GWAS was for age-related macular degeneration by Klein et al. (2005), which described the association of the aforementioned SNP, rs380390, with a relatively high odds ratio of 4.6: meaning that presence of the C allele versus G yields an over four-fold increase in AMD risk4. Since 2005, GWAS has become a mainstay in human genetics research with 2,267 published studies curated in the

NHGRI-EBI Catalog of published genome-wide association studies (GWAS Catalog)10,11 as of

August 2015. Likewise, over 15,020 SNPs have been identified that are associated with one or more trait(s) (Figure 1.1).

4

Figure 1.1. The most recent GWAS diagram displaying all SNP-trait associations with p-values ≤ 5×10-8. Loci are organized by . The legend on the right shows the corresponding color of the phenotype class for each association12.

A number of these variants have demonstrated a strong effect size on the phenotype with which they are associated. An oft-cited example is for age-related macular degeneration, for which 50% percent of the proportion of heritability was explained by five identified variants by

200913. As another example, GWAS efforts have successfully identified over 50 loci associated with type 1 diabetes, which together are thought to explain 67-75% of the familial risk for this disease14. These GWAS findings have led to a richer understanding about the physiological mechanisms at play in the development of the two conditions15. GWAS have also identified genes with clinical relevance, including promising clinical trials for a drug for an autoimmune disorder, ankylosing spondylitis16, and variation in SLCO1B1 associated with LDL cholesterol response to statin treatment17.

To date, GWAS have seen far more success in identifying disease-risk loci than linkage and candidate association studies14. One of the major advantages of this method is the ability to

5 recruit large sample sizes without the requirement of using related samples. GWAS also allow for genetically agnostic assessments of disease, which has led to novel gene discoveries. The compilation of GWAS results in the GWAS Catalog further enabled researchers to identify functional categories of regions that are enriched in GWAS results18. Perhaps the greatest benefit

GWAS offer, as described by Hirscchorn (2009), is novel pathway discovery, which can be used as a stepping-stone for subsequent complexity analyses18.

Missing Heritability

Genome-wide associations studies have been successful at identifying numerous loci across the genome that are associated with a diverse and wide range of phenotypes. However, as predicted by Wang (2005), the distribution of the effect sizes of these associations demonstrates that a low proportion of SNPs have a high effect on the traits with which they are associated, while the majority of SNPs have a low effect size19 (Figure 1.2). The odds ratio cutoff separating high from low impact loci is somewhat arbitrary; however, Manolio et al. (2009) offer a 2-fold risk increase as a relaxed definition of a variant exhibiting a large effect13. Yet, for the majority of associations, the odds ratio is less than 1.5 (Figure 1.2).

6 1000

900

800

700

600

500

400

300 Number of Associations

200

100

0 1.1 1.2 1.4 1.6 1.8 2 2.2 2.4 2.6 2.8 3 4 5 6 9 10 12 14 18 30 40 50 60 70 160 Odds Ratio

Figure 1.2. The number of associations with a p-value < 5×10-8 curated in the GWAS Catalog (as of August 2015) at increasing odds ratios. The majority of associations demonstrate low odds ratios, while few have odds ratios that are high. Note: only associations with an available odds ratio were included in this plot (3,372 associations).

The past decade has seen GWAS’s success at uncovering part of the genetic etiology of some complex traits, which has led to notable examples of the capabilities of this tool. For the majority of associations discovered, however, the SNP explains a small proportion of the trait’s heritability – the fraction of variation in a phenotype that is due to genetic factors13. This pattern of “missing heritability” has been widely discussed13,20,21, and many have criticized GWAS as a simplistic method that does not embrace complexity13,20–22. While, for some traits, like type 1 diabetes and age-related macular degeneration, a relatively large fraction of the heritability has been explained, others have not seen such success. For Crohn’s disease, as an example, the 32 risk loci identified by 2009 together explained 20% of the proportion of the trait’s heritability, and the 18 risk loci identified for type 2 diabetes explained only 6%13. While over 40 variants were identified for height, they only explained 5% of the heritability together, though the trait is estimated to have 80% heritability13,20. With disease prevention and treatment as the ultimate

7 goal, explaining the remainder of the heritability will lead to a deeper and richer understanding of the physiological, biological, and genetic underpinnings of various common diseases.

Many theories have been widely discussed in recent years regarding where the missing heritability lies, including: rare variation13,20,22 and structural variation13,20, epigenetics23, the environment and gene-environment interactions13,18,20,21, and gene-gene interactions13,18,19,21,24.

However, as described in a pivotal paper by Moore et al (2010), the missing heritability in GWAS associations is likely due to the simplistic nature of GWAS itself; specifically, it merely explores one SNP and one phenotype at a time21. This approach fails to embrace and model the complexity that exists in biology and between phenotypes.

Three limitations of GWAS are addressed in this dissertation. First, GWAS explore traits in isolation, preventing investigators from uncovering biological and molecular relationships across multiple phenotypes. Second, GWAS are restricted to genetic tests alone, and do not account for the exposome. Finally, GWAS do not allow for interactions between genetic loci, though genetics and biology are complex. The remainder of this introduction will focus on the benefits of exploring current phenotype interrelationships, the exposome and gene-environment interactions, and gene-gene interactions.

Interrelationships Across the Phenome

Genome-wide association studies consider associations one SNP and one phenotype at a time.

Yet, Hindorff et al. (2009) demonstrated evidence of phenotypic interrelationships early on when analyzing association results from the GWAS Catalog12. Eighteen loci were associated with more than one trait. Pleiotropy, when a single locus is associated with two or more phenotypes25,26, is a well-documented phenomenon. Examples of pleiotropy have been observed in mice27 and fruit flies28, as well as in humans. For example, in humans, a mutation causing truncation in the encoded by microphthalmia-associated factor (Mitf) in Waardenburg syndrome causes a number of discrepant traits: iris defects, pigmentation abnormalities, reduced mast cell production, and

8 deafness29. As another example in humans, truncation due to mutation in a different gene, αB- crystallin (CryAB) is associated with cataracts30 and dilated cardiomyopathy31.

Current results from the GWAS Catalog (Figure 1.1 above) reveal numerous loci associated with more than one phenotype class. However, these results were compiled from many different datasets. Phenome-wide association studies (PheWAS) offer a complimentary approach to GWAS and the means to uncover the etiology of complex phenotypes. This approach involves simultaneous assessment of association between genetic variants and a diverse set of phenotypes.

Such analyses allow investigators to unveil existing interrelationships across multiple traits and identify genetic variants exhibiting pleiotropic effects. The PheWAS method has been utilized across numerous datasets, including the National Health and Nutrition Examination Survey

(NHANES)32 (further described in Chapter 2), electronic medical record (EMR) data33–38, and the

Population Architecture using Genomics and Epidemiology (PAGE) network39, as well as clinical trial data from the AIDS Clinical Trial Group (ACTG)40. PheWAS enable researchers to consider multiple phenotypes, and connections among the traits, within the same study. Large and diverse sets of phenotypes can be explored, including disease outcomes as well as intermediate, quantitative phenotypes (such as lipid or iron levels). Considering relationships between SNPs, disease outcomes, and intermediate phenotypes can help to shed light on yet undiscovered genetic etiology and on the intermediate components that lead to disease.

Extending our investigation beyond one phenotype at a time to explore pleiotropy and phenotype networks offer many benefits, including the chance to deepen our understanding of pathways and signaling cascades involved in multiple phenotypes, the underlying biology of related phenotypes, and the genetic underpinnings of comorbidities. Results such as these will likely yield precise prediction capability for multiple diseases.

Challenges facing PheWAS include an often exceedingly high multiple testing burden.

Testing between loci across the genome (as many as 500,000 to over 1 million SNPs) and numerous phenotypes (often greater than 1,000) in one study incurs a major correction hit. One

9 method to reduce the number of tests is to select candidate loci expected to exhibit pleiotropy.

Other methods may aid in handling the multiple testing issue without use of prior knowledge, and these include permutation testing and gene-based tests, the latter of which reduce the number of tests by combining the genetic component into segments, including, genes or pathways. Examples of these gene-based analysis tools include BioBin41,42 and Sequence Kernel Association Test

(SKAT)43. Another current constraint when employing PheWAS is that the genetic data often come from GWAS genotyping platforms, which are biased to common SNPs from nuclear DNA.

Methods embracing rare variants, structural variation, and non-nuclear DNA will likely provide further evidence of pleiotropy. Recently, two PheWAS were performed using copy number variant (CNV) data (Kim et al., in preparation) and mitochondria DNA44. Further, the binning methods previously described, BioBin and SKAT, are tools to consider when investigating rare variants in PheWAS, and such efforts are underway (Pendergrass et al., in preparation).

Finally, PheWAS is a high-throughput approach that may suffer as a result. When considering large sets of associations at once, it is computationally and time intensive to make refined, phenotype-specific adjustments. Thus, researchers can either make no adjustments of potential confounders, or a standard set of confounders could be chosen for every phenotype (e.g., age, sex, and race). This can result in not making the adjustments that are needed for some phenotypes or unnecessarily over-adjusting for other traits. Additionally, non-standardization of phenotypes across study sites decreases the capacity for replication of results. The Phenotypes and Exposures (PhenX) Toolkit was developed as a resource for collecting standardized measures of phenotypes and environmental exposures45 and can be used as a resource to assuage issues relating to replication.

Despite challenges facing PheWAS, this method offers a complimentary approach to

GWAS to explore phenotype networks and pleiotropy. Chapter two describes a phenome-wide association study using data from the NHANES to explore evidence of pleiotropy across multiple

10 racial/ethnic groups. Methods like PheWAS that explore complexity beyond GWAS are essential to understanding the biological underpinnings of human phenotypes.

The Exposome and Gene-Environment Interactions

While genome-wide association studies have contributed to a greater understanding of the genetic component of complex traits, this approach does not consider the richly diverse and complex environment that is an integral component of human life. Knowledge of the impact of exposures on health will reveal individual or combinations of exposures that confer risk. Further, elucidating the nature of gene-environment interactions will help to determine the impact of alterations in biological pathways that are only predictive of disease under specific environmental conditions.

Ultimately, the goal is that information about genetic and environmental complexity is used to guide environmental policies to protect vulnerable populations.

There are multiple research opportunities for identifying environmental contributions to disease, from individual environmental exposures to the “exposome”: the totality of exposures for each individual over the life course46–48. For some diseases, there are known environmental exposures that contribute to risk, such as smoking49 or UV exposure50 in age-related cataracts.

Yet, for many disorders, the environmental contribution to disease risk and severity are active areas of exploration and discovery, such as understanding the etiology of autism51,52.

Distinguishing important environmental factors from large numbers of possible exposures is a challenge; however, environment-wide association studies (EWAS)53 offer a high-throughput solution.

Prior to the development of EWAS, researchers were restricted to assessing one or a small number of environmental exposures at a time, and these were typically chosen because they had some demonstrated involvement with the chosen trait. Similar to GWAS, EWAS enabled discovery of novel environment-phenotype associations, as it does not rely on what is known in environmental epidemiology. The relatively recent advent of EWAS has allowed researchers to

11 scan all available environmental measures agnostically and identify multiple, previously undiscovered exposures predicting disease. The first EWAS was performed by Patel et al., using

266 exposures from NHANES for type 2 diabetes (T2D). These researchers found associations for polychlorinated biphenyls (PCBs) and vitamin γ-tocopherol as risk exposures for T2D and β- carotene as a protective factor53. Since this study, several other EWAS have been performed, including for metabolic syndrome54, serum lipid levels55, all-cause mortality56, preterm birth57, as well as another EWAS for T2D using phenotype data from an electronic medical record58 (further described in Chapter 3). Using the EWAS approach, targeted studies have additionally looked at diet alone for blood pressure59 and metal exposure60.

Environment-wide association studies additionally offer an approach for filtering exposures for subsequent GxE interactions. Pairwise tests for interactions between genome-wide

SNPs and hundreds to thousands of exposures increase the multiple testing burden such that detecting true signal is challenging. EWAS can be utilized to identify putative exposures for interactions with all, or filtered, genetic variants, thereby greatly reducing the number of tests.

This approach has successfully been implemented in a recent study by Patel et al. (2013) for

T2D55. Using results from the EWAS previously described for T2D, these investigators selected five exposures to test for interactions with known risk loci in NHANES. The authors identified a statistically significant interaction between the exposure, trans-β-carotene, and rs13266634

(solute carrier family 30, member 8 (SLC30A8)), and, while this SNP is a known risk locus for

T2D, this novel discovery provides important insights into the biological mechanisms underlying this complex trait.

Other examples of gene-environment interactions have been observed in humans.

Depression, for instance, was found by Caspi et al. (2003) to be two times more likely in individuals who are heterozygous or homozygous for the short allele of the serotonin transporter

(5-HTT) promoter after stressful life events (involving threat, loss, humiliation, or defeat) than

12 those who are homozygous for the long allele61. As another example, a polymorphism in N- acetyltranscerase 2 (NAT2) was found to interact with smoking or occupational exposures in association with bladder cancer in the International Project on Genetic Susceptibility to

Environmental Carcinogens62. NAT2 encodes an enzyme involved in activation and deactivation of arylamine, which has a well-established association with bladder cancer and is found in occupational exposures and cigarettes63. It has been suggested that the NAT2 polymorphism may modulate the effect of occupational exposures and smoking, specifically arylamines62.

A major limitation to successfully exploring the expsome and its impact on health and disease is the availability of standardized measures to robustly collect exposure data. The aforementioned NHANES provides a rich resource for exploring blood levels of exposures.

However, these types of studies are expensive, and therefore, are few and far between. Survey tools offer an accessible and relatively inexpensive alternative. The previously mentioned PhenX

Toolkit is a standardized questionnaire that allows for replication of EWAS and GxE interaction results across multiple studies. Still, even with the increased availability of data like NHANES and PhenX, the exposome is largely elusive. Excluding notable exceptions like somatic mosaicism, the genome is largely static throughout life. Fortunately, large efforts have made sequencing of the human genome possible. Conversely, the exposome is dynamic and ever changing throughout the life course. Additionally, what is considered an “exposure” and

“outcome”/“phenotype” can vary depending on the context. For example, blood glucose levels may rise as a result of diet, in which case it would be considered an outcome. Yet, there may be downstream effects of rises in blood glucose levels (such as cardiovascular disease risk), and in this case it would be considered a predictor (or even an exposure) for those outcomes. It is drastically simpler to think of exposures as external “assaults” on the body; things get far more complex when we include exposures that are inside the body. Even if the original upstream cause of the internal exposure was from external influences, disruptions in pathways and biochemical processes can have effects. Needless to say, there are complexities at play when considering the

13 exposome that are not applicable in the genome. It is likely that the next decade will see a major increase in the availability of exposome data and that more and more researchers will be jumping aboard the exposome wagon. Of course, with availability of data comes the next, perhaps greater, challenge: analyzing the data in ways that are medically meaningful.

Clearly, there are many challenges ahead as the availability of complex exposure data increases. Currently, though, there are multiple research opportunities for identifying important environmental contributions to disease using the data that is available. EWAS provides a powerful, agnostic tool to assess a large set of exposures simultaneously. Chapter 3 describes an

EWAS and subsequent gene-environment interaction analysis for type 2 diabetes using data from the Marshfield Personalized Medicine Research Project. Additional opportunities for novel method development and strategies include incorporating environmental exposure data into phenome-wide association studies (PheWAS) to elucidate pleiotropy in the context of gene- environment interactions (as further discussed in Chapter 6). Ultimately, including the environment in complex modeling is likely to greatly increase the ability to understand and predict complex human disease.

Genetic Interactions

We have seen how embracing phenotypic and environmental complexity has the potential for greater disease prediction ability. In this final section of the chapter, we turn our attention to the need to incorporate genetic interactions in our search for the complex mechanisms underlying common traits. GWAS focus on the effect of each locus individually on the trait of interest. Gene products, however, don’t function in isolation in biology; rather, they interact with one another physically or within one or multiple pathways. Further, regulated protein-protein and protein-

DNA/RNA interactions are essential components of transcription64 and translation65. GWAS do not allow for the complexity known to exist in molecular and cell biology and physiology.

Embracing complexity by exploring gene-gene interactions, or epistasis, is likely to explain much

14 of the heritability missing in knowledge gained from GWAS alone, as traits arise from the vastly dynamic and interconnected systems in biology.

The word “epistasis” was first coined by Bateson (1909) to describe the effect of one gene being masked by another66. Since, the definition has been refined, and the distinction between two aspects of epistasis, statistical and biological, has been made. Statistical epistasis is referred to as the deviation from additivity, while biological epistasis involves the effect of one locus being modulated by another. As a hypothetical example of biological epistasis, we will consider GeneA, which encodes a transcription factor that binds PromoterB. Alterations can occur either in GeneA or PromoterB, rendering an inability of transcription factor binding, and ultimately, these alterations will lead to downstream expression modification of the gene(s) regulated by this process.

For an example of statistical epistasis, let us imagine two loci: LocusA and LocusB. If the relationship between the two loci is additive, we would expect the combined effect of the two on a phenotype to be the addition of the main effect of LocusA and the main effect of LocusB. For example, if there is a 2-fold and 3-fold risk associated with the risk alleles for LocusA and

LocusB, respectively, the additive result of the risk alleles from both loci is a 5-fold increase risk.

If, the relationship is epistatic, however, the effect of the two loci together will not equal the added main effects of the two loci. In the scenario described, the presence of both risk alleles under an epistatic relationship could be a 15-fold risk increase; conversely, risk could decrease to

1.1-fold. In other words, when statistical epistasis occurs, there is a non-linear relationship between the effects of two or more loci when combined.

As described by Moore (2003), epistasis is thought to be ubiquitous in biology67,68.

Examples of epistasis have been observed in many model organisms69, including yeast70–75, C. elegans76–78, D. melanogaster79–81, mice82–84, and A. thaliana85,86. Evidence of epistasis has also been observed in humans. For instance, interactions were identified by Nelson et al., (2001) between ApoB and ApoE in females as well as between the low-density lipoprotein receptor and

15 the ApoAI/CIII/AIV complex in males for triglyceride levels87. Ritchie et al. identified interactions between SNPs in three estrogen metabolism genes, COMT, CYP1B1, and CYP1A1, that were predictive of sporadic breast cancer88. Further, a recent study by Hemani et al. (2014) identified and replicated a large number of genetic interactions involved in gene expression regulation89.

Despite its ubiquity, epistasis has proven challenging to uncover using traditional association techniques. First, computational and time requirements are extensive, with a high penalty for correction of multiple comparisons, when exhaustively testing pairwise combinations of all genome-wide SNPs. For instance, comprehensive combinations between 500,000 SNPs will yield 1×1011 pairwise SNP-SNP tests. If following a Bonferroni correction criteria with an alpha level of 0.05, a model would need to have a p-value below 5×10-13 in order to be considered statistically significant. This is a stringent threshold, and one that will likely result in missing many true positive interaction models. One method for overcoming this challenge is to filter the number of loci investigated. Two main strategies for filtering include: (1) limiting the interaction analysis to only include one or more variants that have demonstrated an association with the trait through GWAS or candidate gene studies: main effect filter; and (2) filtering SNPs based on biologically-established gene-gene interactions90: knowledge-based filtering. Main effect filtering is a logistically simple method, as it typically requires testing a set of SNPs for association with a trait of interest and choosing those exhibiting a main effect. Nevertheless, as described by Ritchie

(2011), the major drawback to this design is that the chosen SNPs are limited to those that have a demonstrated effect on the phenotype on their own and excludes those SNPs that only exhibit an effect on the phenotype when in combination with other loci91. Knowledge-based filtering was presented as a creative alternative to main effect filtering24,91–94. These approaches decrease the investigation search space to biologically related gene pairs. Previous studies have shown success in applying prior knowledge to genetic interaction analyses for ovarian cancer95, clinical cancer

16 outcome96, multiple sclerosis97,98, HIV pharmacogenetics99, age-related cataract100,101 (further described in Chapter 4), HDL cholesterol92,102, and other lipid traits (Holzinger et al., submitted;

De et al., submitted). While the resulting models from this approach are limited by the biological knowledge available, the combinations are chosen because they are related in some manner (e.g., the same pathway, gene family, function, etc). Therefore, SNPs that do not exhibit a main effect will not automatically be excluded from the interaction analysis and are, if biologically relevant, paired with SNPs in genes that are known to be related.

Another major challenge to identifying SNP models with epistatic action is the manner in which SNPs are genetically encoded. Traditional encoding methods include additive, dominant, and recessive, and each make important assumptions about a given SNP’s biological action. It is common for investigators to choose one encoding (most often additive). However, every SNP across the genome does not demonstrate the same genetic action. If a SNP is encoded under a different assumption than its true genetic action, this can lead to a reduction in power or inflation of false positive results. The traditional approach is inflexible and does not reflect the diversity known to exist in genetics.

Genome-wide associations studies do not account for the complexity of biological interactions and networks known to occur in lower-order and animal models as well as in humans. It is also not equipped as a tool to identify putative loci for interaction if said loci have no demonstrated main effect. Alternative methods beyond GWAS are necessary for identifying complexity. Chapter four describes a study using biological knowledge to build informed gene- gene interaction models for epistatic discovery in age-related cataract. Chapter five presents a novel method to calculate genetic action, rather than assign it, for subsequent epistatic application, which has demonstrated robust power and low type 1 error compared to traditional encodings. Methods like those described in chapters four and five are essential for 1) understanding the complex molecular, cellular, and physiological systems underlying common

17 traits, 2) further explaining the heritability missing in the results from GWAS, and 3) identifying combinations of multiple risk genotypes important for disease prevention and treatment.

Impact

Knowledge of the nature of pleiotropy, the environment, gene-environment interactions, and genetic interactions will help to identify genes that are only found to be associated with disease under certain genetic and environmental contexts. The purpose of the methods geared toward complexity is not to replace GWAS, as it has proved a valuable resource in identifying disease risk loci. Rather, the goal is to build upon GWAS and embrace molecular, phenotypic, and environmental complexity. Ultimately, further investigation with sophisticated methods will reveal numerous gene-gene and gene-environment interactions predictive of common, complex disease.

The following dissertation addresses three previously described limitations to GWAS.

Methods described have the potential to: uncover genetic interactions, unveil complex biology behind phenotype networks, inform public policy decisions concerning environmental exposures, and ultimately assess individual disease-risk.

18 CHAPTER 2

UNVEILING INTERRELATIONSHIPS ACROSS PHENOTYPES USING

PHENOME-WIDE ASSOCIATION STUDIES (PHEWAS)

*This chapter is adapted from Hall, M. A. et al. Detection of pleiotropy through a phenome-wide association study (PheWAS) of epidemiologic data as part of the Environmental Architecture for Genes Linked to Environment (EAGLE) Study. PLoS Genet. 10, (2014)

19 Abstract

We performed a phenome-wide association study (PheWAS) utilizing diverse genotypic and phenotypic data existing across multiple populations in the National Health and Nutrition

Examination Surveys (NHANES), conducted by the Centers for Disease Control and Prevention

(CDC), and accessed by the Epidemiological Architecture for Genes Linked to Environment

(EAGLE) study. We calculated comprehensive tests of association in Genetic NHANES using 80

SNPs and 1,008 phenotypes (grouped into 184 phenotype classes), stratified by race-ethnicity.

Genetic NHANES includes three surveys (NHANES III, 1999-2000, and 2001-2002) and three race-ethnicities: non-Hispanic whites (n=6,634), non-Hispanic blacks (n=3,458), and Mexican

Americans (n=3,950). We identified 69 PheWAS associations replicating across surveys for the same SNP, phenotype-class, direction of effect, and race-ethnicity at p < 0.01, allele frequency >

0.01, and sample size > 200. Of these 69 PheWAS associations, 39 replicated previously reported

SNP-phenotype associations, 9 were related to previously reported associations, and 21 were novel associations. Fourteen results had the same direction of effect across more than one race- ethnicity: one result was novel, 11 replicated previously reported associations, and two were related to previously reported results. Thirteen SNPs showed evidence of pleiotropy. We further explored results with gene-based biological networks, contrasting the direction of effect for pleiotropic associations across phenotypes. One PheWAS result was ABCG2 missense SNP rs2231142, associated with uric acid levels in both non-Hispanic whites and Mexican Americans, protoporphyrin levels in non-Hispanic whites and Mexican Americans, and blood pressure levels in Mexican Americans. Another example was SNP rs1800588 near LIPC, significantly associated with the novel phenotypes of folate levels (Mexican Americans), vitamin E levels (non-Hispanic whites) and triglyceride levels (non-Hispanic whites), and replication for cholesterol levels. The results of this PheWAS show the utility of this approach for exposing more of the complex genetic architecture underlying multiple traits, through generating novel hypotheses for future research.

20 Introduction

Genome-wide association studies (GWAS) have led to the discovery of thousands of variants associated with disease and phenotypic outcomes12. These analyses focus on investigating the association between hundreds of thousands to over a million single nucleotide polymorphisms

(SNPs) and a single, or small set, of phenotypes and/or disease outcomes. While a wealth of information about the relationship between SNPs and phenotypes has been revealed, an extensive picture of the complex genetic architecture underlying common disease has yet to be elucidated.

In addition, the relationship between SNPs and multiple phenotypes (pleiotropy) is only beginning to be explored.

A complementary approach to GWAS are phenome-wide association studies (PheWAS), an approach for investigating the complex networks that exist between human phenotypes and genetic variation, through testing a series of SNPs for association with a large and diverse set of phenotypes33,35,39,103. These analyses can be used to investigate the relationship between genetic variants and presence/absence of disease and phenotypic outcomes as well as the association between genetic variation and intermediate clinically measured variables such as cholesterol levels, blood pressure measurements, and total iron binding capacity. PheWAS can be used to replicate relationships found in GWAS as well as to discover novel associations and generate hypotheses for further research. This approach also allows for the detection of SNPs with pleiotropic effects, where one genetic variant is associated with multiple phenotypes25,26.

Investigating the interrelationships that exist between phenotypes as well as between genetic variation and phenotypic variation has the potential for uncovering the complex mechanisms underlying common human phenotypes.

This chapter describes a PheWAS using epidemiologic data from the National Health and

Nutrition Examination Surveys (NHANES) collected by the Centers for Disease Control and

Prevention and accessed by the Epidemiological Architecture for Genes Linked to Environment

(EAGLE) study as part of the Population Architecture using Genomics and Epidemiology

21 (PAGE) network104. A major focus of the PAGE network is the replication and generalization of

GWAS-identified variants in diverse populations, as the majority of published GWAS have been performed in populations of European-decent with little generalization across other racial/ethnic groups. Thus, the PAGE network has pursued investigating associations for genetic variants that have been well replicated in previous research across ancestry groups beyond European-decent.

As a part of PAGE, EAGLE genotyped 80 GWAS-identified variants in two NHANES datasets representing three surveys: NHANES III, collected between 1991 and 1994, and

Continuous NHANES which was collected between 1999-2000 and 2001-2002 across three race- ethnicities. The majority of the SNPs within our study were chosen for genotyping based on published lipid trait genetic association studies (51 SNPs), but our study also included SNPs previously associated with phenotypes such as C-reactive protein levels, coronary heart disease, and age-related macular degeneration, with detailed information about these SNPs in Appendix

Table 2.1. Genotyping was performed in a total of 14,998 NHANES participants with DNA samples including 6,634 self-reported non-Hispanic whites, 3,458 self-reported non-Hispanic blacks, and 3,950 self-reported Mexican Americans. Similar to the PheWAS framework outlined by the PAGE study39, we performed comprehensive unadjusted tests of association for 80 SNPs with 1,008 phenotypes, using linear or logistic regression, depending on the phenotype, stratified by race-ethnicity.

With this approach we replicated many previously reported associations and identified novel genotype-phenotype relationships. We have performed our analyses across multiple genetic ancestries. Most importantly, we have also found indications of pleiotropy for a number of the

SNPs included in our investigation. When further investigating the association results for SNPs relating to multiple phenotypes, some contrasting directions of effect were identified. We further explored the relationship between SNPs, genes, and known biological relationships between the genes, identifying network relationships within these results. The findings in this chapter demonstrate that PheWAS is a useful method for both validating findings from GWAS and

22 discovering previously unknown genotype-phenotype relationships in diverse populations, enriching our understanding of the complex relationships between human phenotypes.

Materials and Methods

Study Design and Populations

Two NHANES surveys [1994] were included in this study. The epidemiological survey data and

DNA samples of NHANES III were collected between 1991-1994 and Continuous NHANES was collected between 1999-2000 and 2001-2002. For some of the phenotypes, harmonization across

NHANES III and Continuous NHANES was possible. Thus, for a subset of phenotypes, we were able to use the two surveys combined, which we refer to as NHANES Combined. NHANES measures the health and nutritional habits of U.S. participants regardless of health status across race-ethnicity, by collecting medical, dietary, demographic, laboratory, lifestyle, and environmental exposure data via questionnaire, direct laboratory measures, and a physical exam.

In NHANES, specific age groups (such as the young or elderly) and racial/ethnic groups are oversampled. The epidemiological data of NHANES and the associated DNA samples were collected by the National Center on Health Statistics (NCHS) at the Centers for Disease Control and Prevention (CDC). All procedures were approved by the CDC Ethics Review Board and written informed consent was obtained from all participants. Because no identifying information is available to the investigators, Vanderbilt University’s Institutional Review Board determined that this study met the criteria of “non-human subjects.”

Genotyping and SNP Selection

For this study, EAGLE genotyped 80 GWAS-identified variants in two NHANES datasets representing three surveys: NHANES III, collected between 1991 and 1994, and Continuous

NHANES, collected between 1999-2000 and 2001-2002. The majority of the SNPs within our study were chosen for genotyping based on published lipid trait genetic association studies. Also

23 included in this study are SNPs previously associated with a range of other phenotypes, and we detail information about these SNPs in Appendix Table 2.1, including the genotyping method for each SNP (unless the SNP was already available within NHANES before EAGLE genotyping, and there we cite the lab that provided the genotypic data to NHANES). Genotyping was performed in a total of 14,998 NHANES participants with DNA samples including 6,634 self- reported non-Hispanic whites, 3,458 self-reported non-Hispanic blacks, and 3,950 self-reported

Mexican Americans. Genotypes included in this study were accessed from (1) genotyping performed using Sequenom by the Vanderbilt DNA Resources Core, or (2) existing data in the

Genetic NHANES database. In addition to genotyping experimental NHANES samples, blinded duplicates provided by CDC and HapMap controls (n=360) as part of the PAGE study were also genotyped. Quality control, which included concordance and Hardy Weinberg Equilibrium, was performed on all SNPs by the CDC. All SNPs that passed quality control are available for secondary analyses through NCHS/CDC.

Statistical Methods

Single SNP unadjusted tests of association were performed for 80 SNPs available in NHANES III and Continuous NHANES and 1,008 phenotypes. When the exact phenotype was measured in

NHANES III and Continuous NHANES, the unadjusted tests of association were also performed for all samples as part of Combined NHANES. Tests of association between all SNPs and phenotypes were performed using linear or logistic regression, depending on whether the phenotype was continuous or binary. For categorical phenotypes, binning was used to create new variables of the form “A versus not A” for each category, and logistic regression was used to model the new binary variables. All continuous phenotypes were natural log transformed, following a y to log (y+1) transformation of the response variable with +1 added to all continuous measurements before transformation to prevent variables recorded as zero from being omitted from analysis. All analyses were stratified by self-reported race-ethnicity. Analyses were

24 performed remotely in SAS v9.2 (SAS Institute, Cary, NC) using the Analytic Data Research by

Email (ANDRE) portal of the CDC Research Data Center in Hyattsville, MD.

NHANES phenotypes

A wide range of phenotypic variables was available for both NHANES III and Continuous

NHANES. We used only phenotypes for this study that could be binned into phenotype classes across more than one NHANES (see phenotype classes section for more details), so that we could seek replication for association results across surveys. Detailed information on the collection of each of the phenotypes is available through the CDC, for NHANES III

(http://www.cdc.gov/nchs/nhanes/nh3data.htm) and for Continuous NHANES

(http://wwwn.cdc.gov/nchs/nhanes/search/nhanes_continuous.aspx)

Phenotype Classes

To facilitate comparisons across NHANES, similar phenotypes from each of the NHANES were binned into 184 “phenotype-classes” via manual inspection of one person and reviewed by a second individual, similar to the phenotype binning of 103. The development of phenotype-classes was necessary for several reasons. First, not all phenotypes and exposures were surveyed or collected in the same way for each iteration of NHANES, and thus could not be completely harmonized. However, some of these phenotypes were similar enough across surveys and to be binned into the same phenotype-class (for example, “Arm Circumference” and “Upper Arm

Length” were both binned in the “Body Measurements (Arm)” phenotype-class). Second, when matching phenotypes and exposures, the labels across and within NHANES varied even for the same phenotypes. For example “Vitamin A” and “Serum Vitamin A” both measured the same phenotype and thus were both classified in the “Vitamin A” phenotype-class. For the majority of

PheWAS results, there were multiple significant NHANES measures for each phenotype class,

25 and we reported the lowest p-value in descriptions of the PheWAS results within the figures and the results.

Threshold for Significance

A significant PheWAS result met all of the following criteria: 1) a SNP-phenotype association was observed in both NHANES III and Continuous NHANES, 2) with p-value < 0.01, 3) allele frequency > 0.01, 4) sample size > 200, 5) for the same race-ethnicity, 6) phenotype class, and 7) direction of effect. For each of these consistent associations, we examined tests of association results for Combined NHANES. Significant PheWAS results were then plotted using Phenogram

105 and PheWAS-View software specifically developed for visualization of PheWAS results

(http://ritchielab.psu.edu/ritchielab/software/).

Biofilter

Biofilter106,107 is a software package that allows the user to download and automatically integrate several different knowledge databases into a single accessible database called the Library of

Knowledge Integration, and then run queries via Biofilter with the resultant integrated data

(https://ritchielab.psu.edu/ritchielab/software/). We used Biofilter to annotate the SNPs of this study with the location and identification of the nearest genes to each of our SNPs, from NCBI dbSNP and NCBI Gene () (http://www.ncbi.nlm.nih.gov/). We also applied information from the Kyoto Encyclopedia of Genes and Genomes (KEGG)108, Gene Ontology (GO) 109, and

NetPath 110. This allowed us to highlight known connections between genes. Thus, we were able to identify any biological pathway or grouping connections between the genes SNPs were in or near in our study.

26 Cytoscape

After we used Biofilter to annotate the genes as described above, we stratified the results by race- ethnicity. We used Cytoscape111 to visualize the connections between genes based on their annotation. Using this visualization tool, we explored networks where one or more SNP was connected, via genes, to mutual pathways or genes, and we did not further investigate any resultant networks comprised of single SNPs.

RegulomeDB

RegulomeDB was used to annotate PheWAS-significant SNPs in this study with functional and regulatory information for our analyses.

Results

The study population characteristics for the epidemiologic surveys accessed by EAGLE for this

PheWAS are given in Table 2.1. Across the data collected for NHANES, there were 14,998 participants with DNA samples. More than half of the participants were female (54.12%), and the median age was 43. While ~44% of the samples were from participants self-described as non-

Hispanic white (n=6,634), more than half of the samples were from participants self-described as either non-Hispanic black (n=3,458) or Mexican American (n=3,950). As expected, based on ascertainment and changes in consenting for genetic studies112, NHANES III had more female and non-European participants with DNA samples compared with Continuous NHANES.

27 Table 2.1. Study population characteristics.

As detailed in the PheWAS workflow diagram shown in Figure 2.1, we first identified

184 phenotype classes across NHANES from a total of 1,008 unique variables available for analysis in NHANES III and Continuous NHANES, respectively (Table 2.2). We then performed unadjusted single SNP tests of association assuming an additive genetic model for each SNP and phenotype class in NHANES III and Continuous NHANES. Our criteria for a significant

PheWAS result was a SNP-phenotype association observed in both NHANES III and Continuous

NHANES with p-value < 0.01, for SNPs with an allele frequency > 0.01, and a sample size >

200, for the same race-ethnicity, phenotype-class, and direction of effect. We identified 69

PheWAS results meeting this significance threshold. Of these 69 PheWAS results, 39 replicated previously reported SNP-phenotype associations from the literature. Of the remaining results, 9 were related to previously reported associations in the literature, and 21 were novel SNP- phenotype associations. Moreover, 13 SNPs showed evidence of pleiotropy – where a particular

SNP was associated with more than one phenotype. For the majority of results meeting our

PheWAS criteria for replication, each SNP had multiple associations for each phenotype class; thus, in the text we report only the most statistically significant result.

28

Figure 2.1. Overview of the approach for this study. Genotypic and phenotypic data were collected in NHANES III and Continuous NHANES. The phenotypes for the two studies were matched into phenotype classes. Comprehensive associations were calculated for the genotypes and phenotypes for each survey independently. The results that were found in both surveys, with p<0.01, for the same phenotype-class, and race-ethnicity, and same direction of effect, were maintained for further inspection in this study.

29 Table 2.2. Phenotype-classes.

30 Replication of Known Results

As a positive control, we first sought evidence for associations that replicate findings from the literature. Replication of previously reported associations validates our PheWAS pipeline and data integrity. Thirty-nine out of the 69 (56.5%) of our PheWAS associations have previously been described in the literature with the same direction of effect, and our results for these associations are presented in Appendix Table 2.2 as well as visualized in Figure 2.2. A proportion of the phenotypes could have phenotypic harmonization such that we could explore the association result for the phenotype across both surveys, NHANES III and Continuous NHANES, which we refer to as NHANES Combined. A Combined NHANES result was not available for every phenotype, as not all phenotypes could be harmonized across both surveys even if phenotypes could be binned into phenotype classes across both surveys. Our result tables contain this NHANES Combined information when available.

31

Figure 2.2. Replicating results for PheWAS. This is a plot of SNP-phenotype associations observed in both NHANES III and Continuous NHANES with p-value <0.01, for SNPs with an allele frequency >0.01, and a sample size >200, for the same race-ethnicity, phenotype-class, and direction of effect. Plotted are results where the significant SNP-phenotype association matches a previously reported SNP-Phenotype association. The first column indicates the chromosome and location of the SNP. The second column indicates the SNP ID, the associated phenotype-class, the self-reported race-ethnicity (NHW = Non-Hispanic Whites, NHB = Non-Hispanic Blacks, or MA = Mexican Americans), and the coded-allele. The next column contains a colored box if association results were available for natural log transformed Continuous NHANES (Continuous NHANES ln+1), un-transformed Continuous NHANES phenotypes, NHANES III untransformed phenotypes (NHANES III), or transformed NHANES III phenotypes (NHANES III ln+1). The next column indicates the p-value for each association, and the triangle direction indicates whether the association had a positive (triangle pointed to the left) or negative direction of effect (triangle pointed to the right). The following columns indicate magnitude of the effect (beta), the coded allele frequency (CAF), and the sample size for the association.

The majority of the SNPs within our study (51 out of 80), but not all of the SNPs, were chosen for genotyping based on published lipid trait genetic association studies (for example,113–

115), and of these, 19/23 lipid-associated SNPs were associated with lipid traits in this PheWAS.

32 For example, total cholesterol levels and LDL cholesterol levels have been previously associated with the SNP rs646776 near CELSR2 in European-descent populations116–118. In this PheWAS, we observed a significant association between rs646776 (coded allele G) and total cholesterol levels in NHANES III (p=3.17x10-6, β= -7.66, n=2,224) and Continuous NHANES (p=9.15x10-7, β= -

0.014, n=3,943) for non-Hispanic whites with the same direction of effect as the association previously reported for this SNP and LDL cholesterol levels. The association between rs646776 and total cholesterol remained significant in Combined NHANES (p=1.0x10-10, β= -0.029, n=6,389).

Related Associations

After determining results where the phenotype of our association matched that of the same SNP- phenotype association in the GWA catalog, we evaluated whether any of our phenotypes were extremely similar to previously published SNP-phenotype associations. There were a total of 9/69

(~13%) PheWAS results where the SNPs had been previously associated with lipid measurements not exactly matching the respective lipid measurements of our study (Appendix Table 2.3 and

Figure 2.3). For example, the SNP rs515135 near APOB/KLHL29 has been previously reported to be associated with LDL cholesterol (LDL-C) levels in European-descent populations119,120. In this

PheWAS, rs515135 (coded allele G) was associated with total cholesterol levels in non-Hispanic whites. For this SNP, the most significant results meeting our PheWAS replication criteria from

NHANES III were: p=0.0024, β=4.85, n=2,569 and Continuous NHANES were: p=1.06x10-5,

β=0.026, n=3959. This variant was also associated with total cholesterol levels in Combined

NHANES (p=1.39x10-7, β=5.13, n=6,528).

33

Figure 2.3. Related results for PheWAS. This is a plot of SNP-phenotype associations observed in both NHANES III and Continuous NHANES with p-value <0.01, for SNPs with an allele frequency >0.01, and a sample size >200, for the same race-ethnicity, phenotype-class, and direction of effect. Plotted are results where the significant SNP-phenotype association is closely related to the phenotype of a previously reported SNP-Phenotype association. The first column indicates the chromosome and bp location of the SNP. The second column indicates the SNP ID, the associated phenotype-class, the self-reported race- ethnicity (NHW = Non-Hispanic Whites, NHB = Non-Hispanic Blacks, or MA = Mexican Americans), and the coded-allele. The next column contains a colored box if association results were available for natural log transformed Continuous NHANES phenotypes (Continuous NHANES ln+1), un-transformed Continuous NHANES phenotypes, NHANES III untransformed phenotypes (NHANES III), or transformed NHANES III phenotypes (NHANES III ln+1). The next column indicates the p-value for each association, and the triangle direction indicates whether the association had a positive (triangle pointed to the left) or negative direction of effect (triangle pointed to the right). The following columns indicate magnitude of the effect (beta), the coded allele frequency (CAF), and the sample size for the association.

Another example of a closely related association was for SNP rs7557067 near APOB, previously found to be associated with triglyceride levels in European-descent populations120. In this PheWAS, rs7557067 (coded allele G) was associated with total cholesterol levels in non-

Hispanic whites from NHANES III (p=0.0050, β= -0.012, n=2,436) and Continuous NHANES

(p=0.0053, β=-0.015, n=3,966). In the larger sample size of Combined NHANES, this association with total cholesterol levels was maintained (p=1.1 x10-4, β= -0.014, n=6,404). Given that total cholesterol includes HDL-C and that HDL-C is inversely correlated with triglycerides 121,122, this

PheWAS finding was also expected.

Novel Associations

The remainder of the PheWAS results with phenotypes that did not match previously reported SNP-phenotype associations had phenotypes very distinct from previously reported phenotypes. A total of 21/69 (~30%) PheWAS results are potentially novel findings. These are

34 associations with a greater divergence between the previously associated phenotype for a given

SNP and the associated phenotype found in this study (Table 2.3). We found novel results for all three racial/ethnic groups. However, only one novel result meeting our PheWAS significance criteria generalized across two or more populations showing the same direction of effect: protoporphyrin levels in both non-Hispanic whites and Mexican Americans for the ABCG2 SNP rs2231142 (coded allele C). Of the replicating measures for protoporphyrin levels, the most significant results for this association in Mexican Americans for NHANES III was: p=2.61x10-7,

β= -0.075, n=2,029, for Continuous NHANES was: p=2.0x10-4, β= -0.079, n=968, and for

Combined NHANES: p=9.41x10-8, β= -5.21, n=3,897. The most significant result for this association in non-Hispanic whites was for NHANES III: p=6.0x10-6, β= -0.062, n=2,587 and for

Continuous NHANES was: p=6.6x10-4, β= -0.06, n=1,667. This SNP was previously associated with uric acid 123–126. We also found this SNP to be associated with uric acid in non-Hispanic whites and Mexican Americans with the same direction of effect as previously reported associations, as well as an additional novel result for blood pressure measurements only in

Mexican Americans with an opposite direction of effect. The number of novel results was similar across race-ethnicities, even with the difference in sample size across non-Hispanic whites, non-

Hispanic blacks, and Mexican Americans that could affect power for detection of novel associations.

An example novel result showing a very unique divergence from previously reported associations was for the SNP rs11206510 (coded allele T) near the gene PCSK9. This SNP has been previously associated with coronary heart disease127, LDL-C119,120,128, and myocardial infarction129 in European-descent populations, but we did not replicate any of those previously reported associations, and this may be because of smaller sample sizes available or differences in phenotype collection between NHANES and the other studies. In this study we found this SNP was associated with serum globulin levels in Mexican Americans from NHANES III (p=0.0095,

β=0.0120, n=2,023), Continuous NHANES (p=0.0042, β=0.012, n=1871), and Combined

35 NHANES (p=8.7x10-4, β=0.015, n=3,894). We examined the direction of effect of this SNP with the previously reported associations for this SNP and the direction of effect was the same.

Another example of novel divergence from previously reported results involved two

SNPs we found to be associated with white blood cell count in non-Hispanic blacks. The SNP rs1800795 (coded allele G) near IL6 previously was associated with C-reactive protein levels130–

132 . In our study, this SNP was associated with white blood cell counts in non-Hispanic blacks from NHANES III (p=0.0047, β= -0.34, n=2038) and Continuous NHANES (p=0.0048, β= -

0.071, n=1,316). We also found that rs4355801 in TNFRSF11B was associated with white blood cell counts in non-Hispanic blacks from NHANES III (p=0.0036, β =0.30, n=6,991), Continuous

NHANES (p=0.0079, β =0.378, n=3,728), and Combined NHANES (p=5.77x10-5, β=0.042, n=3,411). Previously, TNFRSF11B rs4355801 (coded allele G) was associated with bone mineral density in women of European-descent133. We did not observe a significant PheWAS association with C-reactive protein or bone mineral density in our study for these two SNPs, respectively.

We found a total of six novel PheWAS-significant results associated with circulating vitamin levels (vitamin E, vitamin A, and folate). For example, a PheWAS-significant association for the missense SNP rs1260326 (coded allele T) in the gene GCKR was found with vitamin A levels in non-Hispanic whites from NHANES III (p= 6.1x10-3, β= 1.30, n=2,250), Continuous

NHANES (p= 1.11x10-4, β= 2.34, n=1,639), and Combined NHANES (p=1.06x10-5, β=1.65, n=4,189). This SNP was previously associated with serum albumin levels and serum total protein levels in European- and Japanese-descent individuals 134, non-albumin protein levels in Japanese- descent individuals135, platelet counts136, cardiovascular disease risk factors137, C-reactive protein levels138, urate levels123, total cholesterol and triglyceride levels139, and chronic kidney disease140 in individuals of European ancestry, and liver enzyme levels in European- and Asian-descent populations141. None of these previously reported associations replicated in our study. We compared the positive direction of effect of this SNP rs1260326, associated with vitamin levels, with previously reported associations. Associations with the same coded allele (T) with urate

36 levels123, serum albumin levels134, serum total protein levels134, platelet counts136, liver enzyme levels141, cardiovascular disease risk factors137, C-reactive protein levels138, total cholesterol and triglyceride levels139, chronic kidney disease140 all had a positive direction of effect. This allele was associated with non-albumin protein levels135 with a negative direction of effect.

37 Table 2.3. Novel Results. Novel PheWAS results with the same SNP, phenotype class, and race-ethnicity across NHANES, ordered by these variables. Information on the closest gene and previously reported phenotypes from the PAGE network and GWAS Catalog (with PubMed ID) are included for each SNP. Long phenotype description, along with the corresponding p-value, beta (SE = standard error), coded allele frequency (CAF), and sample size for the association are also listed. Significant measures from NHANES III and Continuous NHANES are included, and NHANES Combined was also included when the phenotype was harmonized across both surveys.

Previously Reported Previously Reported Closest Coded Phenotype Phenotype Pubmed Phenotype Race/ Sample Gene(s) SNP CHR Position Allele (PAGE) (GWAS Catalog) ID Class Ethnicity Study Phenotype Description P-value Beta(SE) Size BSND - rs11206510 1 55496039 T LDL Coronary heart disease, 20864672 Globulin MA NHANES (ln + 1)Serum globulin (g/dL) 0.0095 0.012(0.0046) 2023 PCSK9 Cholesterol Myocardial infarction 19060906 III Serum globulin (g/dL) 0.0099 0.052 (0.020) 2023 (early onset), LDL 21378990 (ln + 1)Serum globulin: SI (g/L) 0.0097 0.015(0.0058, 2023 cholesterol 19198609 Serum globulin: SI (g/L) 0.0099 0.523(0.20) 2023 18193043 (ln + 1)Serumglobulin(g/dL) 0.0095 0.012(0.0046) 2023 Serumglobulin(g/dL) 0.0099 0.052(0.020) 2023 (ln + 1)Serumglobulin:SI(g/L) 0.0097 0.015(0.0058 2023 Serumglobulin:SI(g/L) 0.0099 2) 0.52(0.20) 2023 NHANES (ln + 1)Globulin (g/L) 0.0042 0.012(0.0062) 1871 1999-2002 Globulin (g/L) 0.0062 0.56(0.20) 1871 (ln + 1)Globulin (g/dL) 0.0047 0.014(0.0049) 1871 Globulin (g/dL) 0.0062 0.056(0.020) 1871 NHANES (ln + 1)Serum globulin (g/dL) 0.00092 0.011(0.0034) 3894 Combined Serum globulin (g/dL) 0.0012 0.048(0.015) 3894 (ln + 1)Serum globulin: SI (g/L) 0.00087 0.015(0.0044) 3894 Serum globulin: SI (g/L) 0.0012 0.48(0.15) 3894 (ln + 1 Serumglobulin(g/dL), 0.00092 0.011(0.0034) 3894 Serumglobulin(g/dL) 0.0012 0.048(0.014) 3894 (ln + 1) Serumglobulin:SI(g/L) 0.00087 0.015(0.0044) 3894 Serumglobulin:SI(g/L) 0.0012 0.48(0.15) 3894 GCKR rs1260326 2 27730940 T Triglycerides, Cholesterol, total, 20657596 Vitamin A NHW NHANES Vitamin A (ug/dL) 0.0061 1.30(0.47) 2550 lipids Hypertriglyceridemia, 22286219 III Vitamin A (umol/L) 0.0061 0.045(0.017) 2550 Metabolite levels, 20139978 SerumvitaminA(ug/dL) 0.0061 1.30(0.47) 2550 Hematological and 22001757 SerumvitaminA:SI(umol/L) 0.0061 0.045(0.017) 2550 biochemical traits, Liver 19060906 NHANES (ln + 1) Vitamin A (umol/L) 0.00028 0.024 1639 enzyme levels (gamma- 21300955 1999-2002 Vitamin A (umol/L) 0.00012 (0.0066) 1639 glutamyl transferase), C- 18454146 (ln + 1)Vitamin A (ug/dL) 0.00045 0.082(0.021) 1639 reactive protein, Waist 20686565 Vitamin A (ug/dL) 0.00011 0.035(0.0099) 1639 circumference and 20383146 2.34(0.61) related phenotypes, 22139419 NHANES (ln + 1)Vitamin A (ug/dL) 6.37e-05 0.024(0.0060) 4189 Triglycerides, Chronic 20081857 Combined Vitamin A (ug/dL) 1.06e-05 1.65(0.37) 4189 kidney disease, Platelet 21943158 (ln + 1)Vitamin A (umol/L) 3.49e-05 0.016(0.0040) 4189 counts, Two-hour 19060910 Vitamin A (umol/L) 1.08e-05 0.057(0.013) 4189 glucose challenge, (ln + 1)SerumvitaminA(ug/dL) 6.37e-05 0.024(0.0060) 4189 Cardiovascular disease SerumvitaminA(ug/dL) 1.06e-05 1.65(0.37) 4189 risk factors, Metabolic (ln + 3.49e-05 0.016(0.0040) 4189 traits 1)SerumvitaminA:SI(umol/L) 1.08e-05 0.057(0.013) 4189 SerumvitaminA:SI(umol/L) SLC30A8 rs13266634 8 11818478 T Type 2 Type 2 diabetes and 17460697 Vitamin E NHW NHANES (ln + 1)Vitamin E (ug/dL) 0.0086 -0.031(0.012) 2595 3 Diabetes other traits, Glycated 17293876 III Vitamin E (ug/dL) 0.0050 -50.23(17.86) 2595 hemoglobin levels 19401414 Vitamin E (umol/L) 0.0050 -1.16(0.41) 2595 19056611 (ln + 1)SerumvitaminE(ug/dL) 0.0086 -0.031(0.012) 2595 17463248 SerumvitaminE(ug/dL) 0.0050 -50.23(17.86) 2595 17463249 NHANES (ln + 1)Vitamin E (umol/L) 0.0057 -0.041(0.015) 1627 19734900 1999-2002 Vitamin E (umol/L) 0.0037 -1.70(0.58) 1627 19096518 (ln + 1)Vitamin E (ug/dL) 0.0058 -0.042(0.015) 1627 17463246 Vitamin E (ug/dL) 0.0037 -73.28(25.17) 1627

38 NHANES (ln + 1)Vitamin E (ug/dL) 0.00025, - 4222 Combined Vitamin E (ug/dL) 8.54e-05 0.034(0.0095) 4222 (ln + 1)Vitamin E (umol/L) 0.00024 -58.39(14.84) 4222 Vitamin E (umol/L) 8.54e-05 - 4222 (ln + 1)SerumvitaminE(ug/dL) 0.00025 0.033(0.0092) 4222 SerumvitaminE(ug/dL) 8.54e-05 - 4222 0.034(0.0095) -58.39(14.84) -1.35(0.34) FADS1 rs174547 11 61570783 T HDL Resting heart rate, HDL 20639392 Ferritin MA NHANES (ln + 1)Ferritin (ng/mL) 0.0050 0.11(0.037) 1826 Cholesterol cholesterol, Metabolic 21886157 III (ln + 1)Ferritin (ug/L) 0.0050 0.11(0.037) 1826 traits, Phospholipid 21829377 (ln + 1)Serumferritin(ng/mL) 0.0050 0.11(0.037) 1826 levels (plasma), 20037589 (ln + 1)Serumferritin:SI(ug/L) 0.0050 0.11(0.037) 1826 Metabolite levels, 19060906 NHANES (ln + 1)Ferritin (ug/L) 0.0089 0.10(0.040) 1861 Triglycerides, Lipid 22286219 1999-2002 (ln + 1)Ferritin (ng/mL) 0.0089 0.10(0.040) 1861 metabolism phenotype NHANES Ferritin (ng/mL) 0.0064 9.16(3.35) 3687 Combined Ferritin (ug/L) 0.0064 9.16(3.35) 3687 Serumferritin(ng/mL) 0.0064 9.16(3.35) 3687 Serumferritin:SI(ug/L) 0.0064 9.16(3.35) 3687 FADS1 rs174547 11 61570783 T HDL Resting heart rate, HDL 20639392 Folate NHB NHANES (ln + 1)RBC folate (ng/mL) 0.0013 -0.079(0.025) 2028 Cholesterol cholesterol, Metabolic , III RBC folate (ng/mL) 0.00015 -15.14(3.98) 2028 traits, Phospholipid 21886157 (ln + 1)RBC folate: SI (nmol/L) 0.0013 -0.080(0.025) 2028 levels (plasma), , RBC folate: SI (nmol/L) 0.00015 -34.31(9.029) 2028 Metabolite levels, 21829377 (ln + 1)RBCfolate(ng/mL) 0.0013 -0.079(0.025) 2028 Triglycerides, Lipid , RBCfolate(ng/mL) 0.00015 -15.14(3.98) 2028 metabolism phenotypes 20037589 (ln + 1)RBCfolate:SI(nmol/L) 0.0013 -0.080(0.025) 2028 , RBCfolate:SI(nmol/L) 0.00015 -34.31(9.029) 2028 19060906 NHANES Folate, serum (nmol/L) 0.0013 -4.54(1.41) 1330 , 1999-2002 Folate, RBC (nmol/L RBC) 0.0020 -56.51(18.25) 1329 22286219 Folate, serum (ng/mL) 0.0013 -2.0036(0.62) 1330 Folate, RBC (ng/mL RBC) 0.0020 -24.95 1329 (8.059) NHANES Serum folate (ng/mL) 0.0087 -0.82(0.31) 3366 Combined Serum folate: SI (nmol/L) 0.0087 -1.85(0.70) 3366 RBC folate (ng/mL) 0.00039 -15.81(4.45) 3357 RBC folate: SI (nmol/L) 0.00039 - 3357 RBCfolate(ng/mL) 0.00039 35.82(10.080) 3357 RBCfolate:SI(nmol/L) 0.00039 -15.81(4.45) 3357 Serumfolate(ng/mL) 0.0087 - 3366 Serumfolate:SI(nmol/L) 0.0087 35.82(10.080) 3366 -0.82(0.31) -1.85(0.70) LIPC rs1800588 15 58723675 T HDL HDL cholesterol 18193044 Folate MA NHANES Serum folate (ng/mL) 0.0089 -0.32(0.12) 1996 Cholesterol III Serum folate: SI (nmol/L) 0.0088 -0.73(0.28) 1996 Serumfolate(ng/mL) 0.0089 -0.32(0.12) 1996 Serumfolate:SI(nmol/L) 0.0088 -0.73(0.28) 1996 NHANES (ln + 1)Folate, RBC (nmol/L 0.00010 -0.047(0.012) 1838 1999-2002 RBC) 0.00054 -31.060(8.96) 1838 Folate, RBC (nmol/L RBC) 0.00010 -0.047(0.012) 1838 (ln + 1)Folate, RBC (ng/mL 0.00054 -13.71(3.95) 1838 RBC) Folate, RBC (ng/mL RBC)

39 NHANES (ln + 1)Serum folate (ng/mL) 0.0074 -0.035(0.013) 3837 Combined Serum folate (ng/mL) 0.0030 -0.46(0.15) 3837 (ln + 1)Serum folate: SI (nmol/L) 0.0083 -0.037(0.014) 3837 Serum folate: SI (nmol/L) 0.0030 -1.058(0.35) 3837 (ln + 1)RBC folate (ng/mL), 0.0020 -0.032(0.010) 3815 RBC folate (ng/mL) 0.00045 -9.38(2.67) 3815 (ln + 1)RBC folate: SI (nmol/L) 0.0020 -0.032(0.010) 3815 RBC folate: SI (nmol/L) 0.00045 -21.26(6.053) 3815 (ln + 1)RBCfolate(ng/mL) 0.0020 -0.032(0.010) 3815 RBCfolate(ng/mL) 0.00045 -9.38(2.67) 3815 (ln + 1)RBCfolate:SI(nmol/L) 0.0020 -0.032(0.010) 3815 RBCfolate:SI(nmol/L) 0.00045 -21.26(6.053) 3815 (ln + 1)Serumfolate(ng/mL) 0.0074 -0.035(0.013) 3837 Serumfolate(ng/mL) 0.0030 -0.46(0.15) 3837 (ln + 1)Serumfolate:SI(nmol/L) 0.0083 -0.037(0.014) 3837 Serumfolate:SI(nmol/L) 0.0030 -1.058(0.35) 3837 LIPC rs1800588 15 58723675 T HDL HDL cholesterol 18193044 Vitamin E NHW NHANES (ln + 1)Vitamin E (ug/dL) 0.000590. 0.044(0.013) 2553 Cholesterol III Vitamin E (ug/dL) 0035 56.21(19.26) 2553 Vitamin E (umol/L) 0.0035 1.31(0.45) 2553 (ln + 1)SerumvitaminE(ug/dL) 0.00059 0.044(0.013) 2553 SerumvitaminE(ug/dL) 0.0035 56.21(19.26) 2553 NHANES (ln + 1)Vitamin E (umol/L) 0.00065 0.060(0.017) 1609 1999-2002 Vitamin E (umol/L) 0.0017 2.15(0.69) 1609 (ln + 1)Vitamin E (ug/dL) 0.00063 0.062(0.018) 1609 Vitamin E (ug/dL) 0.0017 92.75(29.57) 1609 NHANES (ln + 1)Vitamin E (ug/dL) 2.32e-05 0.045(0.010) 4162 Combined Vitamin E (ug/dL) 0.00020 61.83(16.58) 4162 (ln + 1)Vitamin E (umol/L) 2.37e-05 0.043(0.010) 4162 Vitamin E (umol/L) 0.00020 1.43(0.38) 4162 (ln + 1)SerumvitaminE(ug/dL) 2.32e-05 0.045(0.010) 4162 SerumvitaminE(ug/dL) 0.00020 61.83(16.58) 4162

ABCG2 rs2231142 4 89052323 C Gout Uric acid levels 22229870 Blood MA NHANES (ln + 1)K5 for second BP 0.00030 0.044(0.012) 1697 , Pressure III measure(diastolic, mmHg) 19503597 (Diastolic) K5 for second BP 0.00098 1.76(0.53) 1697 , measure(diastolic, mmHg) 18834626 (ln + 1)K5 for third BP measure 0.0063 0.032(0.012) 1694 (diastolic, mmHg) K5 for third BP measure 0.0046 1.51(0.53) 1694 (diastolic, mmHg) (ln + 1)Overall average K5, 1.45e-06 0.033(0.0068) 2023 diastolic BP(age5+) 2.21e-06 2.19(0.46) 2023 Overall average K5, diastolic, BP(age5+) NHANES BPXDI3:Diastolic: Blood pres(3rd 0.0045 1.69(0.59) 1605 1999-2002 rdg) mm Hg NHANES K5 for first BP measure (diastolic, 0.0088 1.092(0.41) 3338 Combined mmHg) (ln + 1) K5 for second BP 0.0033 0.032(0.011) 3340 measure(diastolic, mmHg) K5 for second BP 2.89e-05 1.66(0.39) 3340 measure(diastolic, mmHg) K5 for third BP measure 3.98e-05 1.65(0.40) 3299 (diastolic, mmHg) Overall average K5, diastolic, 5.21e-07 1.82(0.36) 3837 BP(age5+) ABCG2 rs2231142 4 89052323 C Gout Urate levels 22229870 Protoporph MA NHANES (ln + 1)Protoporphyrin (ug/dL 2.61e-07 -0.075(0.015) 2029 , yrin III RBC) 0.0040 -3.92(1.36) 2029 19503597 protoporphyrin (ug/dL RBC) 7.87e-06 - 2029 , (ln + 1)Protoporphyrin (umol/L 0.0040 0.037(0.0083) 2029 18834626 RBC) -0.070(0.024)

40 Protoporphyrin (umol/L RBC) NHANES (ln + 1)Protoporphyrin (umol/L 0.00037 -0.042(0.012) 968 1999-2002 RBC) 0.0018 -0.094(0.030) 968 Protoporphyrin (umol/L RBC) 0.00020 -0.079(0.021) 968 (ln + 1)Protoporphyrin (ug/dL 0.0018 -5.321(1.70) 968 RBC) Protoporphyrin (ug/dL RBC) NHANES Protoporphyrin (ug/dL RBC), 9.41e-08, 5.21(0.97) 3897 Combined Protoporphyrin (umol/L RBC) 9.85e-08 0.092(0.017) 3897 ABCG2 rs2231142 4 89052323 C Gout Urate levels 22229870 Protoporph NHW NHANES (ln+1)Protoporphyrin (ug/dL RBC) 6.00e-06 -0.062(0.014) 2587 , yrin III (ln+1)Protoporphyrin (ug/dL RBC) 6.00e-06 -0.062(0.014) 2587 19503597 (ln+1)Protoporphyrin (umol/L RBC) 9.60e-06 -0.032(0.0073) 2587 , (ln+1)Protoporphyrin (umol/L RBC) 9.60e-06 -0.032(0.0073) 2587 18834626 Protoporphyrin (umol/L RBC) 3.70e-05 -0.087(0.021) 2587 Protoporphyrin (umol/L RBC) 3.60e-05 -0.087(0.021) 2587 Protoporphyrin (ug/dL RBC) 3.76e-05 -4.88(1.18) 2587 Protoporphyrin (ug/dL RBC) 3.76e-05 -4.88(1.18) 2587

NHANES (ln+1)Protoporphyrin (ug/dL RBC) 6.6e-04 -0.060(0.01) 1667 9902 (ln+1)Protoporphyrin (umol/L RBC) 0.0012 -0.029(0.0092) 1667 Protoporphyrin (umol/L RBC) 0.0064 -0.058(0.021) 1667 Protoporphyrin (ug/dL RBC) 0.0065 -3.28(1.20) 1667

KCTD10 rs2338104 12 10989516 G HDL HDL cholesterol 19060906 Hearing NHW NHANES Right Threshold @ 1000Hz- 0.0034, 2.20(0.74) 258 8 Cholesterol , III Second Reading, 0.0034 2.20(0.74) 258 18193043 Rightearairhearlvl, repeat, 1000Hz(dB) NHANES Right threshold @ 500Hz 0.0060 1.11(0.40) 1415 1999-2002 NHANES Right threshold @ 1000Hz 0.0071 1.010(0.37) 1669 Combined Right Threshold @ 1000Hz- 0.0028 1.14(0.381) 1673 Second Reading Right threshold @ 500Hz 0.0023 1.12(0.36) 1673 Rightearairhearlevel, first, 0.0071 1.010(0.37) 1669 1000Hz(dB 0.0028 1.14(0.38) 1673 Rightearairhearlvl, 0.0023 1.12(0.36) 1673 repeat,1000Hz(dB) Rightearairhearinglevel, 500Hz(dB) KCTD10 rs2338104 12 10989516 G HDL HDL cholesterol 19060906 Hemoglobi NHW NHANES (ln + 1)Mean cell hemoglobin: SI 0.0016 - 2582 8 Cholesterol , n III (pg) 0.0017 0.0050(0.001 2582 18193043 Mean cell hemoglobin: SI (pg) 0.0016, 6) 2582 (ln + 0.0017 -0.15 (0.048) 2582 1)Meancellhemoglobin:SI(pg) -0.0050 Meancellhemoglobin:SI(pg) (0.0016) -0.15 (0.048) NHANES (ln + 1)Mean Cell Hemoglobin 0.0024 - 3964 1999-2002 Concentration (MCHC) (g/dL) 0.0025 0.0015(0.000 3964 Mean Cell Hemoglobin 0.0018 48) 3964 Concentration (MCHC) (g/dL) 0.0028 -0.050 3964 (ln + 1)Mean cell hemoglobin (pg) (0.017) Mean cell hemoglobin (pg) - 0.0042(0.001 3) -0.12(0.042) NHANES (ln + 1) Mean cell hemoglobin: SI 2.16e-05 - 6546 Combined (pg) 3.85e-05 0.0044(0.001 6546 Mean cell hemoglobin: SI (pg) 0.0038 0) 6546 (ln + 1)Mean cell hemoglobin -0.13(0.031) concentration 0.0043 - 6546 Mean cell hemoglobin 2.16e-05 0.001(0.0003 6546 concentration, (ln + 1) 3.85e-05 6) 6546

41 Meancellhemoglobin:SI(pg) 0.0038 6546 Meancellhemoglobin:SI(pg) -0.036(0.012) (ln 0.0043 - 6546 +1)Meancellhemoglobinconcentra 0.0044(0.001 tion 0) Meancellhemoglobinconcentration -0.13(0.031) - 0.001(0.0003 6)

-0.036(0.012) BUD13 rs28927680 11 11661907 G HDL Triglycerides 18193044 Vitamin E NHW NHANES (ln + 1)Vitamin E (ug/dL) 4.45e-05 -0.086 2596 3 Cholesterol III Vitamin E (ug/dL) 0.00063 (0.021) 2596 Vitamin E (umol/L) 0.00063 -109.024 2596 (ln + 1)SerumvitaminE(ug/dL) 4.45e-05 (31.84) 2596 SerumvitaminE(ug/dL) 0.00063 -2.53 (0.74) 2596 -0.086 (0.021) -109.024 (31.84) NHANES (ln + 1)Vitamin E (umol/L) 0.0010 -0.090 1624 1999-2002 Vitamin E (umol/L) 0.0033 (0.027) 1624 (ln + 1)Vitamin E (ug/dL) 0.0010 -3.17(1.077) 1624 Vitamin E (ug/dL) 0.0033 -0.093 1624 (0.028) - 136.39(46.39) NHANES (ln + 1)Vitamin E (ug/dL) 1.34e-07 -0.090(0.017) 4220 Combined Vitamin E (ug/dL) 5.36e-06 - 4220 (ln + 1)Vitamin E (umol/L) 1.4e-07 122.098(26.7 4220 Vitamin E (umol/L) 5.36e-06 9) 4220 (ln + 1)SerumvitaminE(ug/dL) 1.34e-07 -0.087(0.016) 4220 SerumvitaminE(ug/dL) 5.36e-06 -2.83(0.62) 4220 -0.090(0.017) - 122.098(26.7 9) RPS26P35 - rs4355801 8 11992387 G Bone mineral Bone mineral density 18455228 White NHB NHANES White blood cell count: SI 0.0036 0.30 (0.10) 2077 TNFRSF11B 3 density Blood Cell III Whitebloodcellcount 0.0036 0.30(0.10) 2077 Whitebloodcellcount:SI 0.0036 0.30(0.10) 2077 NHANES White blood cell count SI 0.0079 0.378(0.14) 1334 1999-2002 NHANES (ln + 1)White blood cell count: SI 5.77e-05 0.042(0.010), 3411 Combined White blood cell count: SI 7.19e-05 0.33(0.083), 3411 (ln + 1)Whitebloodcellcount:SI 5.77e-05 0.042(0.010), 3411 Whitebloodcellcount:SI 7.19e-05 0.33(0.083) 3411 SLC2A9 rs6855911 4 9935910 G Uric acid Urate levels 17997608 Body NHB NHANES Thigh circumference (cm)(2 yrs 0.0012 0.80(0.25) 1972 Measureme III and over) nts (Leg) NHANES (ln + 1)Thigh Circumference (cm) 0.0087 0.014 1256 1999-2002 Thigh Circumference (cm) 0.0061 (0.0055) 0.89 1256 (0.33) NHANES (ln + 1)Thigh circumference 1.09e-05 0.015(0.0034) 3228 Combined (cm)(2 yrs and over) 6.12e-06 0.90(0.19) 3228 Thigh circumference (cm)(2 yrs and over) APOB - rs562338 2 21288321 T LDL LDL Cholesterol 18262040 Hearing NHB NHANES (Ln+1)Leftearairhearlvl,repeat,10 0.0071 0.20(0.075) 377 KLHL29 Cholesterol 18193040 III 00Hz(dB NHANES Left threshold @ 1000Hz (db) 0.0096 1.99(0.77) 528 9902

42 GCKR rs780094 2 27741237 G Metabolic C-reactive Protein,T2D; 22581228 Potassium MA NHANES Potassium(mg) 0.0043 -132.98(0.66) 1961 syndrome, glucose/HOMA-B 22399527 intake III (ln+1)Potassium(mg) 0.0053 -0.050(0.66) 1961 Phospholipid 21886157 levels 21829377 (plasma), 20081858 Triglycerides, 19503597 Fasting 18439548 glucose- 18193044 related traits, 18193043 LDL 18179892 cholesterol, Fasting insulin- related traits, Uric acid levels, Metabolic traits, C-reactive protein

NHANES Potassium (mg) 0.0047 -139.47(0.67) 1798 9902 GCKR rs780094 2 27741237 G Metabolic C-reactive Protein,T2D; 22581228 Vitamin B6 MA NHANES VitaminB6(mg) 0.0098 -0.10(0.66) 1961 syndrome, glucose/HOMA-B 22399527 intake III Phospholipid 21886157 levels 21829377 (plasma), 20081858 Triglycerides, 19503597 Fasting 18439548 glucose- 18193044 related traits, 18193043 LDL 18179892 cholesterol, Fasting insulin- related traits, Uric acid levels, Metabolic traits, C-reactive protein NHANES Vitamin B6 (mg) 0.0042 - 1798 9902 0.11274(0.67)

None rs1800795 7 22766645 G C-reactive N/A 15820616 White NHB NHANES White blood cell count: SI 0.0047 -0.34(0.12) 2038 Protein 16544245 Blood Cell III 16644865 Count 17003362 17416766 17623760 17694420 17916900 17996468 18041006 18239642 18257935 18276608 18321738 18449426 18458677

43 18752089 18992263 19056105 19106168 19140096 19267250 19280716 19330901 19377912 19387461 19435922 19452524 19542902 19592000 19671870 19833146 19853505 19876004 20043205 20044998 20132806 20149313 20175976 20176930 20333461 20361391 20436380 20459474 20592333 20622166

NHANES (ln+1)Segmented Neutrophils number0.0048 -0.071(0.025) 1316 9902 (ln+1)White blood cell count SI 0.0071 -0.054(0.020) 1324 Segmented Neutrophils number 0.0083 -0.325(0.12) 1316

NHANES (ln+1)White blood cell count: SI 7.01E-05 -0.047(0.011) 3362 Combined White blood cell count: SI 0.00016 -0.36(0.096) 3362 KCNQ1 rs2237895 11 2857194 C Type 2 Type 2 Diabetes 20174558 Body MA NHANES Arm circumference(cm)(2 months 0.0019 -0.47(0.15) 1987 Diabetes Measureme III and over) nt (Arm) (ln+1)Arm circumference(cm)(2 0.0020 - 1987 months and over) 0.015(0.0047)

NHANES (ln+1)Upper Arm Length (cm) 0.0061 - 1835 9902 Upper Arm Length (cm) 0.0068 0.0062(0.002 1835 2) -0.23(0.084) None rs1529729 19 11163562 G LDL-C None 18714375 Broken/ NHW NHANES Doctor told had broken/fractured 0.0042 0.493(0.25) 2303 Fractured III spine 2303 Bone NHANES (ln+1)Broken or fractured spine 0.002 1.54(0.14) 3933 9902 Broken or fractured spine 0.002 1.54(0.14) 3933

44 Identification of Pleiotropy

While any of the novel PheWAS associations indicate potential pleiotropy as all of the

SNPs of this study have previously reported genome-wide associations, within our study, we found 13 SNPs with more than one significant PheWAS phenotype class (Table 2.4 and Figure

2.4). While the majority of these were SNPs were associated with more than one lipid phenotype, there were nine SNPs associated with other phenotypes.

Figure 2.4. Potentially pleiotropic results. These are the PheWAS-significant results of this study with more than one distinct phenotype-class associated with the same SNP. This is a plot of SNP-phenotype associations observed in both NHANES III and Continuous NHANES with p-value <0.01, for SNPs with an allele frequency >0.01, and a sample size >200, for the same race-ethnicity, phenotype-class, and direction of effect. Plotted are results where the significant SNP-phenotype association matches a previously reported SNP-Phenotype association. The first column indicates the chromosome and bp location of the SNP. The second column indicates the SNP ID, the associated phenotype-class, the self- reported race-ethnicity (NHW = Non-Hispanic Whites, NHB = Non-Hispanic Blacks, or MA = Mexican Americans), and the coded-allele. The next column contains a colored box if association results were available for natural log transformed NHANES III phenotypes (NHANES III ln+1), un-transformed NHANES III phenotypes (NHANES III), or natural log transformed Continuous NHANES phenotypes (Continuous NHANES ln+1), or untransformed Continuous NHANES phenotypes. The next column indicates the p-value for each association, and the triangle direction indicates whether the association had a

4 5 positive (triangle pointed to the left) or negative direction of effect (triangle pointed to the right). The following columns indicate magnitude of the effect (beta), the coded allele frequency (CAF), and the sample size for the association.

Table 2.4. Pleiotropic results.

46 For example, the missense SNP in ABCG2 rs2231142, also described in novel results, was found to have two novel associations, protoporhyrin (in non-Hispanic whites and Mexican

Americans) and blood pressure levels (Mexican Americans), and one replication of a previously known association with uric acid levels (non-Hispanic whites and Mexican Americans). The results for this SNP are plotted in Figure 2.5.

Figure 2.5. Sun plot of (p<0.01) results for ABCG rs2231142, coded allele C. This SNP has been previously reported to be associated with uric acid levels. Significant SNP-phenotype associations (p<0.01) are plotted clockwise with the smallest p-value result at the top. The length of the each line corresponds to the –log(p-value) of each result, with the longest line representing the most significant result for this SNP, meeting our PheWAS replication criteria for inclusion. Study, transformed (LN +1) or untransformed (none) phenotype description, self-reported race-ethnicity, and direction of effect are listed for each association. This SNP was associated with a number of phenotypes in this study including uric acid levels (as previously published) in non-Hispanic whites (NHW) and Mexican Americans (MA), protoporphyrin levels in non-Hispanic whites and Mexican Americans, and diastolic blood pressure in Mexican Americans.

For another example, rs2338104, an intronic SNP in KCTD10, which was previously associated with HDL cholesterol (HDL-C) in European-descent populations120,128, was associated here with hemoglobin and hearing levels, both novel results in non-Hispanic whites (Figure 2.6).

Another example of potential pleiotropy was for SNP rs1800588 near LIPC, previously

47 associated HDL-C in European-descent populations 118. We observed significant associations between this SNP and the novel phenotypes of folate (in Mexican Americans) and vitamin E levels (in non-Hispanic whites), as well as replication for cholesterol and the related phenotype of triglycerides (both in non-Hispanic whites; Figure 2.7). The intronic SNP rs174547 of FADS1 provides another example. This SNP was previously associated with phospholipid levels142, resting heart rate143, phosphatidylcholine levels144, HDL-C and triglyceride levels120 in individuals of European ancestry. Here, this SNP is associated with ferritin levels in Mexican Americans and with folate levels in non-Hispanic blacks.

Figure 2.6. Sun plot of (p<0.01) results for KCTD10 rs2338104, coded allele G. This SNP was previously associated with HDL-C levels. Significant SNP-phenotype associations (p<0.01) for this study are plotted clockwise with the smallest p-value result at the top. The length of the each line corresponds to the –log(p- value) of each result, with the longest line representing the most significant result for this SNP, meeting our PheWAS replication criteria for inclusion. Study, transformed (LN +1) or untransformed (none) phenotype description, self-reported race-ethnicity, and direction of effect are listed for each association. This SNP was associated with mean cell hemoglobin levels, as well as right ear hearing levels in non-Hispanic whites (NHW).

48

Figure 2.7. Sun plot of (p<0.01) results for LIPC rs1800588, coded allele T. This SNP was previously associated HDL-C in European-descent populations. Significant associations (p<0.01) are plotted clockwise with the most significant value result at the top. The length of the each line corresponds to the – log(p-value) of each result, with the longest line representing the most significant result for this SNP, meeting our PheWAS replication criteria for inclusion. Study, transformed (LN +1) or untransformed (none) phenotype description, self-reported race-ethnicity, and direction of effect are listed for each association. This SNP was associated with a number of phenotypes including folate in Mexican Americans (MA), total cholesterol in non-Hispanic whites (NHW), triglyceride levels in non-Hispanic whites, and vitamin E levels in non-Hispanic whites.

To further characterize these putative pleiotropic relationships, we compared direction of effect for each association (Table 2.4). We found variants related to potentially protective effects for certain traits, and a potential risk effects for other traits. For example, intergenic SNP rs12678919 near LPL was associated with HDL cholesterol levels in non-Hispanic whites with a positive direction of effect and hearing in non-Hispanic blacks with a negative direction of effect

(coded allele G). Intronic SNP rs174547 in FADS1 was associated with ferritin levels in Mexican

Americans with a positive direction of effect and folate (in non-Hispanic blacks) and triglycerides

(in non-Hispanic whites) with a negative direction of effect (coded allele T). The intronic SNP rs6855911 in SLC2A9 was associated with uric acid (in both non-Hispanic blacks and Mexican

49 Americans) with a negative direction of effect and thigh circumference measurements (non-

Hispanic blacks) with a positive direction of effect (coded allele G).

Investigating Interrelationships within PheWAS Results

PheWAS-significant results provide an opportunity to explore the relationships between

SNPs, genes, traits/outcomes, and pathways or other known relationships between genes and gene-products. We used the software tool Biofilter to identify the genes the PheWAS-significant

SNPs were within or closest to. We then used Biofilter to annotate the resultant genes using the

Kyoto Encyclopedia of Genes and Genomes (KEGG)108, Gene-Ontology (GO)109, and NetPath110 which allowed us to identify any known connections between genes due to shared biological pathways or other known biological connections. After stratifying the results by race-ethnicity, we used Cytoscape111 to visualize the connections between genes based on their annotation. We present here the networks where there were two or more SNPs significant in our PheWAS connected via genes and those two or more genes were connected by a pathway or other gene- gene connection.

For example, Figure 2.8 shows one example for PheWAS results in Mexican Americans, where LPL SNP rs328 had a significant association with HDL-C levels, and the FADS1 SNP rs17547 had an association with ferritin levels. Both genes are found in the TGF-β receptor regulated NetPath pathway. Figure 2.9 shows another example in Mexican Americans in which three SNPs were associated with uric acid levels: rs2231142, rs7442295, rs685911. One of the

SNPs is located within the gene ABCG2, and the other two SNPs are located within SLC2A9

(blue boxes). Both ABCG2 and SLC2A9 are found within the GO biological process “urate metabolic process”, a collection of the gene products involved in the chemical reactions and pathways involving urate. These same connections were also found for non-Hispanic whites, as this group had a PheWAS-significant association between these SNPs and uric acid levels. One of

50 the SNPs, rs2231142, was also associated with diastolic blood pressure and protoporphyrin levels.

Figure 2.8. Using PheWAS results, Biofilter, and Cytoscape to explore gene-gene connections with NetPath. We used Biofilter to annotate the SNPs of this PheWAS with gene information. We then mapped the genes to concomitant pathways or other gene groupings through GO, KEGG, and NetPath. This is one example for the results for Mexican Americans and annotation with NetPath. The pink diamonds are associated phenotypes of this PheWAS, the green hexagons are SNPs, blue boxes are genes, and circles are biological connections that link genes together, in this case the two genes are in the same TGF NetPath biological pathway. Thus, we see that in the PheWAS results, the LPL SNP rs328 had a significant association with HDL cholesterol levels, and FADS1 rs17547 association with Ferritin levels, and both genes are found in the TGF beta receptor pathway.

51

Figure 2.9. Using PheWAS results, Biofilter, and Cytoscape to explore gene-gene connections with GO biological processes. Three SNPs were associated with uric acid levels in Mexican Americans: rs2231142, rs7442295, rs685911 (green hexagons). One of the SNPs is within the gene ABCG2, and the other two SNPs are within SLC2A9(blue boxes). Both ABCG2 and SLC2A9 are found within the GO biological process “urate metabolic process”, a collection of the gene products involved in the chemical reactions and pathways involving urate. This was also found for non-Hispanic whites.

Figure 2.10 displays an example using KEGG and the Mexican American PheWAS results. LPL and LIPC both are involved in the KEGG biological process “glycerolipid metabolism”. LPL SNP rs328 was associated in this study with HDL-C, while LIPC SNP rs1800588 was associated with folate levels. LPL was also involved in the KEGG pathway

“Peroxisome Proliferator-Activated Receptor (PPAR) signaling pathway”, along with APOA5,

52 which was associated with triglyceride levels through its SNP rs3135506. PPARs are transcription factors activated by lipids.

Figure 2.10. Using PheWAS results, Biofilter, and Cytoscape to explore gene-gene connections with KEGG connections. The LIPC SNP rs1800588 was associated with folate levels, the LPLSNP rs328 was associated with HDL cholesterol, and both of these genes are in the glycerolipid metabolism KEGG pathway in Mexican Americans. The APOA5 SNP rs135506, associated with triglyceride levels in our study, shares the PPAR signaling pathway along with LPL.

Discussion

For this PheWAS, performed using the data of NHANES, we have replicated a number of previously published results and have found novel and pleiotropic associations. For example, for rs2231142, a missense SNP in ATP-binding cassette subfamily G member 2 (ABCG2), we replicated previous associations with uric acid levels observed in European-descent populations and in Mexican Americans with the same direction of effect. Additionally, we identified a novel association for this SNP with protoporphyrin in both the European-descent population and

Mexican Americans, where the coded allele (C) was associated with increased uric acid levels as well as increased protoporphyrin. This PheWAS finding is intriguing in light of some of the

53 known connections that link protoporhyrin with uric acid levels, suggesting the potential for this

SNP to have an impact on the levels of one or both resulting in the associations identified here.

Protoporhyrin combines with heme to form iron-containing . This gene is in the bile secretion pathway, and bile consists of substances including bilirubin, which is converted from heme/porphyrin. Thus, the observed association is consistent with a known biological process.

There is also a known correlation between ferritin levels and uric acid levels, and urate forms a coordination complex with iron to diminish electron transport, acting as an iron chelator and antioxidant 145. This correlation implies an expected link between protoporphyrin and uric acid association results; however, we did not observe an association with ferritin levels in this study for this SNP.

The PheWAS significant association between rs2231142 and blood pressure levels was only observed in Mexican Americans. However, the direction of effect is opposite as seen for uric acid levels and protoporphyrin. There is a demonstrated positive correlation between high blood pressure and high serum uric acid levels145, but the relationships between rs2231142 and diastolic blood pressure compared with serum uric acid levels in our study were inconsistent, suggesting an independent relationship between this SNP and the two phenotypes. Thus, this is an example of the novel discoveries that can occur with the PheWAS approach that would not be found through only investigating the association between multiple SNPs and a single trait outcome or phenotype.

Novel associations for hematologic traits were found in this PheWAS. The SNP rs1800795 near gene interleukin 6 (IL6) and rs4355801 in tumor necrosis factor receptor superfamily, member 11b (TNFRSF11B) had significant association with white blood cell counts in non-Hispanic blacks. There are known associations between hematologic traits and genetic variants on chromosome 1 in African Americans, spanning a wide region of chromosome 1146.

This region of association is due to the presence of the African-derived Duffy Null polymorphism, a genetic variant protective against Plasmodium vivax malaria. Presence of this

54 variant explains the lower white blood cell and neutrophil counts in African Americans147.

However, neither rs1800795 nor rs4355801 are located on chromosome 1 and, therefore, represent potentially unique associations with hematologic traits.

Further novel associations with circulating vitamin levels were found. The SNP rs1260326 was associated with vitamin A in non-Hispanic whites. Vitamin E was associated with rs13266634, rs28927680, and rs1800588 in non-Hispanic whites and rs964184 in non-Hispanic whites and Mexican Americans. Additionally, folate levels were associated with rs174547 in non-

Hispanic blacks and rs1800588 in Mexican Americans. When considering the direction of effect for the vitamin levels, we found that rs174547, an intronic SNP in fatty acid desaturase 1

(FADS1), was associated with ferritin and iron levels with different direction of effect in Mexican

Americans. Conversely, vitamin E showed the same direction of effect as triglycerides. Recent findings indicate a potential relationship between vitamin E intake and triglyceride levels for certain SNPs. These results may be reflective of an interaction between variability in vitamin E intake and genetic variance.

Other SNPs with pleiotropic effects showed associations with different directions of effect. For example, rs780094 in the of glucokinase regulator (GCKR) was associated with serum glucose levels with a positive direction of effect (0.67) and potassium and vitamin B6 intake levels with a negative direction of effect (β = -0.05 and -0.11, respectively) in Mexican

Americans. This result is consistent with the demonstrated inverse relationship between potassium intake and glucose intolerance148. Likewise, glucose tolerance has been found to increase upon vitamin B6 supplement intake in women with gestational diabetes mellitus149,150.

One possibility, requiring further investigation, is that this SNP modulates the effect of vitamin

B6 and potassium on glucose levels.

Fourteen of our results showed both a significant PheWAS association and the same direction of effect for a different race-ethnicity. We did not investigate non-significant results with a similar direction of effect for this study. We evaluated the differences in allele frequency

55 across the two surveys, across race-ethnicity, for the SNPs that met our criteria for PheWAS replication (Appendix Table 2.4). There were not consistent trends between similar or markedly different allele frequencies and whether we did or did not see the same SNP-phenotype associations across more than one race-ethnicity. The reason for differences in association may lie in the variation between linkage disequilibrium patterns across populations. Additionally, as genetic architecture can vary across different race-ethnicities, there is the potential for finding novel associations that exist in only one population. Low power due to sample size could have also contributed to fewer significant associations in non-Hispanic black and Mexican American populations, when compared to non-Hispanic whites, as the sample sizes were generally smaller.

Further, phenotypic outcome is impacted by both genetic variation and environmental exposure variation, and thus some associations may not replicate across race-ethnicity in part due to potentially different environmental exposure across racial/ethnic groups.

We found examples of gene-gene connections that link our PheWAS results from the

SNP to gene to pathway level. These examples show the utility of applying known information about genes to provide biological context for individual PheWAS results through visually linking the information together. Multiple connections not readily apparent when exploring tabular results can be highlighted with this approach. For example, Figure 2.9 shows three SNPs within two different genes that are within the GO biological process of “urate metabolic process”, a group of gene products involved in the chemical reactions and pathways involving urate. These

SNPs are all associated with uric acid levels in our PheWAS. These SNPs have previously reported associations with uric acid levels, and these genes are known to be involved with pathways that contain urate. However, through connecting phenotypes, SNPs, genes, and pathways, and visualizing the results, we can more clearly show how single genetic variants are likely biologically linked to outcome variation. Further, this example shows the SNP rs2231142 associated with two other phenotypes, as described earlier in this discussion.

56 We also presented network results in Figures 2.8, 2.9 and 2.10. The results presented in

Figure 2.8 show two SNPs in different genes that both are found in the TGF-β receptor regulated

NetPath pathway. This would not have been evident in the PheWAS without applying annotation from known pathways. Figure 2.10 shows one example of two genes involved in the KEGG biological process “glycerolipid metabolism”. Here, one SNP is associated with HDL-C levels, and, interestingly, a separate SNP in the network is associated with folate levels. Plasma folate levels have been associated with lipoprotein profiles151. Further, the LPL SNP rs328 was associated in this study with HDL-C and is also involved in the KEGG pathway “Peroxisome

Proliferator-Activated Receptor (PPAR) signaling pathway”, along with a SNP in APOA5, which was associated with triglyceride levels. PPARs are transcription factors activated by lipids. In the future we will continue to use this network approach, to highlight both the biological context that supports results found in PheWAS and the biological annotation that may identify relationships that forge new hypotheses about the connection between genetic variation and complex outcomes.

One limitation to the current PheWAS approach is the risk of false-positive associations due to the large number of tests for association between SNPs and phenotypes. For this analysis, we required replication of association results across NHANES to reduce the type-1 error rate.

Correcting for multiple hypothesis testing to account for the comprehensive associations in

PheWAS, and thus potentially inflated Type I error, based on the number of tests/studies/groups can be problematic for multiple reasons. Multiple testing calculations assume independent tests, which we do not have here as phenotypes are correlated across our PheWAS studies. Also, our power from one result to another can vary in part due to variations in sample size for the specific phenotype. In addition we used phenotype-class binning of results which results in different numbers of sub-phenotypes in each bin for potential replication. Future work includes research into identifying additional methods for multiple testing burden in PheWAS, such as permutation testing and binning methods. Another limitation to the PheWAS approach is the high-throughput

57 nature of the analysis. For instance, adjustments were not made for participants on medication that could modify or lower measurements such as lipids. The results are considered preliminary and bear further inquiry. However, it is notable that we observed replication of a number of previously published results with the same direction of effect indicating that our high-throughput approach is functional for a number of measures. Because we chose to seek replication across

NHANES surveys, we did not explore results unique to any one survey.

A major strength of the PheWAS approach is the potential for novel discoveries about genetic variants and their relation to phenotypes for future investigation as well as to replicate results found in GWAS. Phenome-wide associations provide the opportunity to uncover complex networks of phenotypes involved in disease through tests of association between genetic variants and a broad range of phenotypes. Utilizing existing epidemiologic collections such as the diverse

NHANES allows for potential generalization of variant-phenotype relationships across race- ethnicities.

We have found novel associations for phenotypes such as white blood cell count and vitamin levels for SNPs with different, previously known associations. We have additionally found indications of pleiotropy. Further, because this approach investigates single SNPs with multiple phenotypes, results with contrasting direction of effect can be investigated. We explored the results of this PheWAS within the context of additional biological information including the use of network diagrams. In addition, we were able to pursue this across multiple race-ethnicities, whereas much of the approach in GWAS has been within European Americans. The results described here demonstrate the utility of the PheWAS approach to expose relevant results that differ from what is already known about the relationships between multiple phenotypes and between genotype and phenotype to uncover the complexity that exists between human traits.

58 Funding Acknowledgements

The "Epidemiologic Architecture for Genes Linked to Environment (EAGLE)" was funded through the NHGRI PAGE program (U01HG004798 and its NHGRI ARRA supplement).

Genotyping services for select NHANES III SNPs presented here were also provided by the

Johns Hopkins University under federal contract number (N01-HV-48195) from NHLBI. The study participants derive from the National Health and Nutrition Examination Surveys

(NHANES), and these studies are supported by the Centers for Disease Control and Prevention.

The funders had no role in study design, data collection and analysis, decision to publish, or preparation of the manuscript.

59 CHAPTER 3

INVESTIGATING THE EXPOSOME AND GENE-ENVIRONMENT INTERACTIONS

USING ENVIRONMENT-WIDE ASSOCIATION STUDIES (EWAS)

* This chapter is adapted from Hall, M. A. et al. Environment-wide association study (EWAS) for type 2 diabetes in the Marshfield Personalized Medicine Research Project Biobank. Pac. Symp. Biocomput. 200–11 (2014).

60 Abstract

Environment-wide association studies (EWAS) provide a way to uncover the environmental mechanisms involved in complex traits in a high-throughput manner. Genome-wide association studies have led to the discovery of genetic variants associated with many common diseases but do not take into account the environmental component of complex phenotypes. This EWAS assessed the comprehensive association between environmental variables and the outcome of type

2 diabetes (T2D) in the Marshfield Personalized Medicine Research Project Biobank (Marshfield

PMRP). We sought replication in two National Health and Nutrition Examination Surveys

(NHANES). The Marshfield PMRP currently uses four tools for measuring environmental exposures and outcome traits: 1) the PhenX Toolkit includes standardized exposure and phenotypic measures across several domains, 2) the Diet History Questionnaire (DHQ) is a food frequency questionnaire, 3) the Measurement of a Person’s Habitual Physical Activity scores the level of an individual’s physical activity, and 4) electronic health records (EHR) employs validated algorithms to establish T2D case-control status. Using PLATO software (Hall et al., in preparation), 314 environmental variables were tested for association with T2D using logistic regression, adjusting for sex, age, and BMI in over 2,200 European Americans. When available, similar variables were tested with the same methods and adjustment in samples from NHANES

III and NHANES 1999-2002. Twelve and 31 associations were identified in the Marshfield samples at p<0.01 and p<0.05, respectively. Seven and 13 measures replicated in at least one of the NHANES at p<0.01 and p<0.05, respectively, with the same direction of effect. The most significant environmental exposures associated with T2D status included decreased alcohol use as well as increased smoking exposure in childhood and adulthood. The results demonstrate the utility of the EWAS method and survey tools for identifying environmental components of complex diseases like type 2 diabetes. These high-throughput and comprehensive investigation methods can easily be applied to investigate the relation between environmental exposures and multiple phenotypes in future analyses.

61 Introduction

Computational methods to assess environmental exposures are essential to elucidate the complex nature of common human phenotypes. Genome-wide association studies (GWAS) have allowed for greater understanding of the genetic component of complex traits and identification of numerous loci associated with these traits12. They have provided a high-throughput approach for comprehensive testing of variants across the genome. However, this approach fails to consider the richly diverse and complex environment with which humans interact throughout the life course.

While GWAS have uncovered thousands of single nucleotide polymorphisms (SNPs) associated with disease, much remains unclear about the heritability and mechanisms that lead to common, complex human diseases12,20. It is likely that environmental exposure greatly impacts the genetic and cellular systems at play for many complex traits20. Environment-wide association studies (EWAS)53 provide a method to test a variety of exposures across the human environment in a high-throughput, unbiased manner, much like GWAS tests for genetic effects. EWAS can be further employed as a main effect exposure filter for subsequent gene-environment interaction analysis. This approach acts to reduce the multiple testing burden conferred when running pairwise interactions between all available genetic and exposure variables. The utility of the

EWAS approach was demonstrated for type 2 diabetes (T2D) using an array of laboratory measurements to identify a diverse number of exposures associated with T2D53. Such comprehensive laboratory measurements are rare and only assess exposures at a fixed time point without consideration of the various exposures throughout an individual’s lifetime. Thus, there is a need to evaluate comprehensive and standardized survey tools that enable assessment of exposures and lifestyle choices over time and comparison of results across multiple studies.

The PhenX (consensus measures for Phenotypes and eXposures) toolkit

(https://www.phenxtoolkit.org/) was developed as a resource for collecting standardized measures of phenotypes and environmental exposures45. Measures are available across 27 domains covering alcohol, tobacco, and other substance use; demographics; mental health; environmental

62 exposures; diet; and disease, among others. In addition to providing information on traits, many of these measures can be used to ascertain information on environment, lifestyle, and environmental exposures. Other valuable resources for environmental measures include 1) the

Measurement of a Person’s Habitual Physical Activity, a questionnaire measuring a person’s work, leisure, and sport activity level152 (Baecke), and 2) the Dietary History Questionnaire

(http://riskfactor.cancer.gov/DHQ/), a food frequency questionnaire (DHQ)153,154.

Electronic health records (EHR) are a growing resource for measuring health outcomes in individuals, as they contain vast amounts of medical data including records of diagnoses, procedures, and clinical laboratory measurements155. These data can be used, with electronic algorithms, to systematically define cases and controls for numerous phenotypes of interest, such as type 2 diabetes. The Electronic Medical Records and Genomics (eMERGE) Network combines

EHR data from sites across the United States and currently utilizes electronic phenotyping algorithms for over a dozen phenotypes156. The Marshfield Personalized Medicine Research

Project Biobank (Marshfield PMRP)157, part of the eMERGE Network, is one site currently employing EHR phenotyping as well as the PhenX Toolkit, the Measurement of a Person’s

Habitual Physical Activity (Beacke), and the Dietary History Questionnaire (DHQ). Taken together, the PMRP is a rich phenotypic resource for genomic and environmental association analyses to dissect the architecture of complex traits.

Presented in this chapter are the results of an EWAS for type 2 diabetes using survey questions from the PhenX Toolkit, DHQ, and Beacke surveys from the Marshfield PMRP. To seek replication of these results with similar survey questions when available, we used data from the National Health and Nutrition Examination Surveys (NHANES)158. Further, top results from

EWAS were used as putative exposures for gene-environment interactions models.

To the authors’ knowledge, this is the first EWAS performed using EHR data.

Environment-wide association studies provide a methodology to test environmental measures in a comprehensive, high-throughput manner. Integration of EWAS with phenome-wide association

63 studies (PheWAS) and genome-wide association studies (GWAS will further elucidate the complex interplay of gene and environment in common traits as well as the ways in which exposures modulate pleiotropy. Using multiple exposure and outcome variables to assess environment and lifestyle factors using EWAS will provide a richer understanding of the architecture of complex traits.

Methods

Marshfield PMRP and Type 2 Diabetes Case Identification

The Marshfield PMRP is a population based biobank with ~20,000 subjects, aged 18 years and older, enrolled in the Marshfield Clinic healthcare system in central Wisconsin. DNA, plasma, and serum samples are collected at the time the enrollee completes a written informed consent document, with allowance for ongoing access to the linked electronic health records (EHR).

PMRP participants also complete questionnaires, including responses regarding smoking history, occupation, physical activity, diet, and a variety of other PhenX measures. A subset of the

Marshfield PMRP subjects completed the PhenX survey, the DHQ, and/or the Measurement of a

Person’s Habitual Physical Activity (Table 3.1).

The NHGRI funded eMERGE network (Electronic Medical Records and Genomics) has implemented robust electronic phenotyping algorithms to select cases and controls for a number of different phenotypes/outcomes156. Using an algorithm developed by eMERGE159, T2D patients were diagnosed by their records from the Marshfield EHR. The Marshfield samples were originally selected for eMERGE based on their cataract case-control status; however, this is an example of the reusability of biobank samples for additional traits. T2D cases were defined as having the following in their EMR: a T2D ICD-9 medical billing code, information about insulin medication, abnormal glucose or HbA1c levels, or more than two diagnoses of T2D by a clinician.

All T2D cases with an ICD-9 code for T1D were removed from further analyses. All control subjects had to have at least 2 clinical visits, at least one blood glucose measurement, normal

64 blood glucose or HbA1c levels, no ICD-9 codes for T2D or any related condition, no history of being on insulin or any diabetes related medication, and no family history of T1D or T2D.

Table 3.1. Marshfield Type 2 Diabetes Case/Control Sample Size for Each Questionnaire

Questionnaire Total Sample Size # Cases T2D # Controls

PhenX 2,243 433 1,810

Total DHQ 2,606 559 2,047

Activity 2,571 552 2,018

PhenX 898 204 694

Male DHQ 1,051 260 791

Activity 1,035 257 778

PhenX 1,345 229 1,116

Female DHQ 1,555 299 1,256

Activity 1,535 295 1,240

Age All > 50

Ancestry All European

Environmental Variable Measurements

Phenx Toolkit

The PhenX Toolkit (www.phenxtoolkit.org) was accessed to develop a self-administered questionnaire to assess environmental and lifestyle factors. Some of the PhenX measures were chosen because of the potential for gene/environment associations with age related cataract - which is a primary disease of interest for PMRP (smoking, alcohol, ultraviolet light exposure), some were chosen because of the potential for validation against prior PMRP questionnaire data and medical history information (demographics, physical activity, family history of heart attack, history of stroke) and the rest were chosen because of the potential for future research and cross- site collaborations (hypomania/mania symptoms, hand dominance) within the network funded

65 through administrative supplements to collect PhenX measures. The time to complete the questionnaire ranged from 20 to 40 minutes in pre-testing, depending on how many questions were logical skips. The 32-page self-administered questionnaire was mailed to all eligible subjects with a cover letter and return address envelope. A second mailing was employed to increase the response rate. Subjects were offered $10 for their time to complete the questionnaire.

PhenX survey data were entered and merged with prior PMRP questionnaire information from the Marshfield Clinic electronic medical record. For validation purposes, the electronic medical record was considered to be the gold standard where possible. Two hundred fifty-five measures from the PhenX Toolkit were included for our analysis. Questions included a range of topics from the following classes: alcohol use, smoking, demographics, depression, mania, activity, residential environment, and UV exposure.

Diet History Questionnaire

Food frequency questionnaires (FFQs) are widely used to assess dietary intake in epidemiologic studies because they are more representative of usual intake and less expensive to implement than other methodologies including weighed food records and 24-hour dietary recalls because they are usually self-administered. Inclusion of aids to estimate portion sizes is essential to improve the accuracy and validity of FFQs153. Self-administered food frequency questionnaires (FFQ) are available on approximately 2/3 of the PMRP cohort to quantify usual dietary intake of all major nutrients. The selected FFQ, the Diet History Questionnaire (DHQ)

(http://riskfactor.cancer.gov/DHQ/), was developed by researchers at the National Cancer

Institute (NCI) and has been shown to be superior to the commonly used Willett FFQ and similar to the Block FFQ for estimating absolute nutrient intakes153. All three FFQs produce similar results after statistical adjustment for total energy intake. The list of foods and portion sizes on the DHQ was developed from nationally representative data, the USDA’s 1994-1996 Continuing

Survey of Food Intakes by Individuals, and is therefore most appropriate for use with this study

66 population. The DHQ comprises 124 separate food items and asks about portion sizes for most foods. In addition, there are 10 questions about nutrient supplement intake. The DHQ was printed and scanned by National Computer Systems as has been done for all recent studies conducted at the NCI using the DHQ. The completed DHQ was mailed to National Computer

Systems for scanning. After scanning, the data from the questionnaires are stored in ASCII format and then uploaded into the nutrient analysis software package. Diet*Calc software, available from the National Institutes of Health, is used for the nutrient analyses of the DHQ data

(http://riskfactor.cancer.gov/DHQ/dietcalc/). This is the software package that was used for analysis of the DHQ for the Eating at America’s Table Study. The DHQ is mailed to participants with their appointment reminders so that they can complete it prior to their appointment to save them time. The Research Project Assistants review all DHQs to ensure that they have been completed. Fifty-six measures of dietary intake were assessed for this EWAS that covered the following domains: vitamin, fat, protein, carbohydrate, fiber, cholesterol, caloric, grain, vegetable, caffeine, and alcohol intake.

Measurement of a Person’s Habitual Physical Activity

As with measurement of dietary intake for epidemiologic studies, there are a number of different validated tools that have been used in the past. The agreement between physical activity questionnaire and gold standard tends to be somewhat lower than for dietary intake, but is reasonable for ranking relative activity levels in groups. The researchers preferred to use a previously developed physical activity assessment tool to allow comparison with results from other study populations. Requirements of the selected tool included: 1) self-administered, 2) previously validated, and 3) validated for use in a similar study population across a range of ages.

The selected physical activity questionnaire, the ARIC/Baecke questionnaire, is self-administered, validated for use in both men and women, and currently being used in a large, prospective study in the US152. The questionnaire has been shown to have high reliability and accurate assessment

67 of both high intensity activity and light intensity activity such as walking. It comprises 16 questions and generates three indices of activity: 1) a work index, 2) a sport index, and 3) a leisure-time index. This one-page self-administered physical activity questionnaire is mailed along with appointment reminders and the Diet History Questionnaire (DHQ). Information from the completed physical activity questionnaires are entered twice into a Microsoft Access database.

The two entries are compared to ensure accuracy of the data entry. The three physical activity indices (work, sport, and leisure-time) are calculated and the data merged with anthropometric, dietary, and demographic data for subsequent analyses.

National Health and Nutrition Examination Surveys (NHANES)

NHANES III Phase 2, conducted between 1991-1994, and NHANES 1999-2002 measures the health and nutritional habits of participants by collecting medical, dietary, demographic, laboratory, lifestyle, and environmental exposure data using questionnaire and laboratory measures. The data of NHANES were collected by the National Center on Health Statistics

(NCHS) at the Centers for Disease Control and Prevention (CDC). All participants were consented by the CDC at the time of the survey and sample collection.

To seek replication of the Marshfield results, we identified measures similar to the most significant Marshfield PMRP EWAS results in NHANES III and NHANES 1999-2002. Because different survey methods were utilized between Marshfield PMRP and the NHANES, measures were chosen when they matched a significant broad environmental “class”. For example, many smoking measures were included in the most significant EWAS results and any smoking measure found in either NHANES was included for replication. T2D case/control status was defined using an algorithm previously described160.

Genotype Data

DNA samples from Marshfield Clinic were genotyped using the Illumina 660W-Quad array. Data

68 were cleaned using the eMERGE QC pipeline developed by the eMERGE Genomics Working

Group161. This process includes evaluation of sample and marker call rate, sex mismatch, duplicate and HapMap concordance, batch effects, Hardy-Weinberg equilibrium, sample relatedness, and population stratification. For the discovery dataset, QC thresholds included: marker call rate > 99%, sample call rate > 99%, and minor allele frequency (MAF) > 5%. After applying this QC filter, 498,654 markers remained for interaction analysis.

Statistical Analysis

A total of 314 environmental variables were included in our analysis of the Marshfield data.

Logistic regression was used, adjusting for age, sex, and body mass index (BMI), with PLATO

(Hall et al., in preparation). Control was coded as 1 and case as 2. All significant results were investigated to ensure that all top ranking associations had greater than 10 responses for both cases and controls. For the gene-environment interaction analyses, the top EWAS result in the

Marshfield dataset (Alcohol 30 Day Frequency) was tested for interactions with a genome-wide set of quality control filtered loci (498,654 SNPs) in samples for whom both genetotype and

PhenX data was available (2,044 samples). To determine the significance of the interaction term, we performed a likelihood ratio test (LRT) between the full (Y = β0 + β1SNP + β2Exposure +

β3SNP×Exposure) and reduced (Y = β0 + β1SNP + β2Exposure) models.

For the NHANES data, logistic regression was used for all association testing, adjusting for age, sex, and BMI, in 46 to 3,964 samples (sample sizes varied for each measure) of European ancestry (self-identified non-Hispanic whites) for a total of 116 environmental variables from

NHANES III (84) and NHANES 1999-2002 (32). All significant EWAS results were assessed to ensure sample size was greater than 10 for cases and controls for each variable. Genome-wide data was unavailable for replication of gene-environment interaction results.

69 Results

In this environment-wide association study of 314 variables for type 2 diabetes, we found 12 results with a p-value less than 0.01 in the Marshfield Clinic samples. Due to the exploratory and hypothesis generating nature of this method, we are presenting all the results with a p-value less than 0.05 (31 results). Figure 3.1 displays the most significant EWAS associations in the

Marshfield sample.

All variables could be placed into seven broad environmental “classes”: smoking, alcohol use, mania, depression, activity, diet, UV exposure, and residence. Table 3.2 includes all results with a p-value less than 0.05 by environment class and displays the survey question for each measure from the PhenX Toolkit.

Figure 3.1. The most significant association results in the Marshfield sample using PhenX Toolkit, DHQ, and Measurement of a Person’s Habitual Physical Activity surveys. The PhenX variables are listed along the Y-Axis. The first track shows the results of our EWAS, with –log(10) of the p-value plotted from most significant result at the top and descending in order. The next track shows the magnitude and direction of the effect. Case/control status was coded as 1=Control, 2=Case.

70

Table 3.2. EWAS Variable Classes, Specific PhenX Toolkit Questions, and the EWAS Marshfield PMRP results

P- Survey: Variable PhenX Toolkit Question Beta Class value Think specifically about the past 30 days, from [DATEFILL], up to PhenX: Alcohol 30Day and including today. During the past 30 days, on how many days did 6E-04 -0.03 Frequency you drink one or more drinks of an alcoholic beverage? When you stopped, cut down or went without drinking, did you ever PhenX: Alcohol experience any of the following problems for most of the day for 2 Withdrawal 0.022 -3.041 Alcohol days or longer? Did you see or hear things that weren't there? (Yes=1, Hallucination No=2) In your entire life, have you had at least 1 drink of any kind of alcohol, PhenX: Lifetime Use 0.035 0.4655 not counting small tastes or sips? (Yes=1, No=2) There are several health problems that can result from long stretches of PhenX: Alcohol Use drinking. Did drinking ever cause you to have liver disease or yellow 0.037 -1.894 Liver Disease jaundice? (Yes=1, No=2) PhenX: Smoking At Does anyone who lives here smoke cigarettes, cigars, or pipes 6E-04 -0.889 Home anywhere inside this home? (Yes=1, No=2) PhenX: Exposure How many hours were you exposed to smoke from other people's 0.003 0.0064 Smoke Childhood cigarettes or tobacco products during childhood per day? Former smokers who did not ever smoke every day for the at least 6 PhenX: Former Smoker months: when you last smoked every day, on average how many 0.006 0.2683 Quantity 1DayB cigarettes did you smoke each day? PhenX: Exposure Were you exposed to smoke from other people's cigarettes or tobacco 0.007 -0.375 Smoke Work products during adulthood at work? (Yes=1, No=2) Smoking PhenX: Former Smoker Did you smoke more frequently during the first hours after waking 0.019 -0.689 Exposure More 1stHour than during the rest of the day? (Yes=1, No=2) PhenX: Former Smoker How soon after you wake up do/did you smoke your first cigarette? 0.02 -0.202 1stSmoke Time PhenX: Exposure Were you exposed to smoke from other people's cigarettes or tobacco 0.021 -0.309 Smoke Home products during adulthood at home? (Yes=1, No=2) Former smokers who smoked cigarettes every day for at least 6 PhenX: Former Smoker months: when you last smoked every day, on average how many 0.039 0.0171 Quantity 1DayA cigarettes did you smoke each day? PhenX: Exposure At the present time, how many hours per day are you exposed to the Smoke Present Time 0.041 0.0943 smoke of others? Hours PhenX: Exposure How many hours per day were you exposed to smoke from other Smoke Adulthood 0.048 0.0046 people's cigarettes or tobacco products during adulthood at home? Home Hours - DHQ: Caffeine(mg) NA 0.001 0.0005 Diet 0 Activity: Leisure Index NA 0.002 -0.27 Activity Activity: Sports Index NA 0.003 -0.31 Please check the box next to the one statement which best describes PhenX:Leisure Activity 0.014 -0.132 the way you spent your leisure-time during most of the last year. PhenX: House Gas Are any gas powered devices stored in any room, basement, or 0.003 0.4187 Powered Device attached garage in this (house/apartment)? (Yes=1, No=2) PhenX: House Farm Is this property actively used as a farm or ranch? (Yes=1, No=2) 0.01 0.5382 Residence What is the type of dwelling? (1=Detached house, 2=Duplex/Triplex, PhenX: Dwelling Type 3=Row house, 4=Low rise apartment (1-3 floors), 5=High rise 0.01 0.0684 apartment (>3 floors), 6=Mobile home / Trailer7=Other) PhenX: Building When did you start living there? 0.024 0.0072 Residence Age PhenX: Air Conditioning Stop During which month (do you usually/would you) stop using air 0.028 0.1891 Month conditioning? Please indicate the one response that best describes your energy level for the past seven days. (0 = There is no change in my usual level of energy. 1 = I get tired more easily than usual. 2 = I have to make a big PhenX: Energy Level 0.005 0.2365 effort to start or finish my usual daily activities (for example,shopping, Depression homework, cooking or going to work). 3 = I really cannot carry out most of my usual daily activities because I just don't have the energy.)

71 About how many weeks altogether did you feel this way? Count the PhenX: Depression weeks before, during and after the worst two weeks. The total period 0.044 -0.022 Number Weeks of depression/loss of interest was: Please try to remember a period when you were in a "high" state. In PhenX: Mania such a state: I am more interested in sex, and/or have increased sexual 0.023 0.3615 Increased Sex Mania desire (Yes=1, No=2) Please try to remember a period when you were in a "high" state. In PhenX: Mania such a state: I am more impatient and/or get irritable more easily 0.044 -0.321 Impatient (Yes=1, No=2) PhenX: Weekend Sun On a typical weekend day in the summer, about how many hours did 0.027 -0.158 Hours Last Decade you generally spend in the mid-day sun in the past ten years? PhenX: Weekday Sun UV On a typical weekday in the summer, about how many hours did you 0.031 -0.151 Exposure Hours Last Decade generally spend in the mid-day sun in the past ten years? PhenX: Tanning Booth Have you ever used a tanning booth? (Yes=1, No=2) 0.042 0.4621 PhenX: Sunlamp Times About how many times have you used a sunlamp in your life? 0.048 0.8917

When available, similar questions from NHANES that fell into one of the above exposure classes were included to seek replication. Measures were available in alcohol use, smoking exposure, diet, activity, depression, and mania but not in residence and UV exposure. Seven of the results were significant at p<0.01 and thirteen at p <0.05 with the same direction of effect as the related Marshfield associations (Figure 3.2).

Figure 3.2. Replicating results of the most significant Marshfield EWAS associations from NHANES III and NHANES 1999-2002. Results were considered a replication if the p-value was < 0.05 p-value and showed the same direction of effect as the Marshfield analyses. Controls were coded as 1 and Cases as 2. This figure is in the same format as Figure 3.1, with NHANES measurements on the y-axis ordered by descending association significance. The tracks show the p-value significance of the association in –log10(p-value) and the magnitude and direction of the effect.

72 The most significant survey questionnaire result in the Marshfield EWAS was alcohol frequency in the last 30 days, which was inversely associated with type 2 diabetes status. This relationship was also observed for two related measures in NHANES III: alcohol consumption questions beer and lite beer -times/month and wine, etc - times/month and one in NHANES 1999-

2002: alcohol consumption question: How often drink wine (per month). Never having alcohol was associated with T2D status in Marshfield and did not replicate in either NHANES, though a similar, but not exact, measure was available and tested. Experiencing excessive alcohol use symptoms like hallucination due to alcohol withdrawal and liver disease from excess alcohol use was associated with having T2D in the Marshfield sample. Neither of these measures were available in either NHANES for comparison.

A number of significant results in Marshfield included measurements of first and second hand smoking exposure. Cigarette or other tobacco smoke exposure at home or at work, and for a greater number of hours during childhood, adulthood, and present time were all associated with

T2D status. Additionally, for former smokers, greater number of cigarettes per day, smoking more frequently during the first hours of the day, and smoking earlier in the day were also associated with having T2D. Two of the smoking measures replicated in NHANES III: number of cigarettes smoked/day when smoked and NHANES 1999-2002: how soon after waking do you smoke? with the same direction of effect.

The two most significant results from the DHQ for the EWAS in Marshfield were a metric of caffeine consumption: caffeine (mg), which was inversely associated with T2D status and a metric of the consumption of carbohydrates (g). The caffeine measurement did not replicate in either NHANES, though increased coffee intake has been previously reported as having an association with lowered risk of T2D162. Carbohydrate intake did not meet the significance threshold of p-value less than 0.05 in Marshfield, but was included in the replication analysis because it was the second most significant DHQ result. When this association was investigated in

NHANES III and NHANES 1999-2002 it was the most significant result for both studies.

73 The results of the GxE interaction analysis yielded 128 SNPs that were found to interact with Alcohol 30-Day Frequency with a likelihood ratio test (LRT) p-value less than 1×10-4

(Figure 3.3). The results with the lowest LRT p-value was for a missense SNP, rs1659219, in glucosidase alpha neutral C (GANC) (LRT p-value: 7.91×10-7).

Figure 3.3. Manhattan plot of SNPs interacting with Alcohol 30 Day Frequency at a LRT p-value < 1×10-4. Associations with LRT p-values greater than this threshold are not included in the figure. SNPs are ordered by chromosome and position.

Discussion

Using a systematic, high-throughput EWAS method, we identified and replicated novel as well as established associations between environmental exposures and T2D. The replicating results of the association between less alcohol use per month and T2D status is consistent with prior research that demonstrates that moderate alcohol use is associated with decreased risk of T2D163,164. The association between T2D status and the specific symptoms of hallucination and liver disease has not been observed in the literature, to the best of the authors’ knowledge. However, prior research has indicated that binge drinking and high levels of alcohol use are associated with increased risk of T2D163,164. It is possible that these results are spurious, or that there may be some mechanism at play by which these extreme alcohol-related measures are related to T2D. Comparison with other

74 studies for this measure is necessary before conclusions can be drawn.

Gene-environment interaction analysis revealed an interaction between alcohol frequency in the last 30 days and a missense SNP in glucosidase alpha neutral C (GANC). GANC a glycosyl hydrolase enzyme that hydrolyzes the glycosidic bond between two or more carbohydrates. This is important enzyme in glycogen metabolism that is associated with diabetes susceptibility and alpha glucosidase inhibitors have been used clinically to lower glucose levels in diabetics165. This

SNP was not available for replication efforts in NHANES, and thus, these results are preliminary.

Nevertheless, this intriguing finding potentially indicates some mechanism by which either the effect of this locus is modulated by alcohol use or vice versa. Further efforts to replicate these findings may yield more information as to the relationship that may exist between this SNP, alcohol use, and T2D.

The relationship between increased smoking exposure and having T2D is well established166–168. Activity level also has a well-documented link with T2D169–171. Here we observed a number of results from both Marshfield and NHANES III that demonstrate this association. Work activity was not significantly associated with T2D in the PhenX or Baecke measures. However, lower amounts of leisure and sports activity was associated with T2D status in Marshfield. This relationship was validated with similar measures in NHANES III: dancing, gardening/yard work, sports, and running or jogging in the past month.

A number of associations from the residence, depression, mania, and UV exposure classes in Marshfield did not replicate in either NHANES. This could indicate that these were false positive findings, or it could also be due to differences in measures that were used, deviation in survey question wording, or low sample sizes for a given question. Additionally, many of these results could not be evaluated for replication in either NHANES because they were not available.

This demonstrates the need for standardized measures of environmental exposures, as the utilization of these measures will allow the validation of significant results across multiple studies.

Another limitation to this EWAS design is the difficulty in determining whether

75 associations occurred simply due to T2D diagnosis. For instance, the activity questions measured activity for the past month and did not include information on activity level during childhood or if activity level changed when T2D symptoms were experienced. It is possible that the individuals with T2D participated in less leisure and sport activity due to symptoms but had greater activity levels earlier in life. Similarly, the inverse association observed between T2D and carbohydrate intake may be reflective of individuals who are restricting carbohydrate intake due to T2D diagnosis, a common dietary treatment for the disease172. This issue indicates the importance of gathering environmental variables that measure multiple points of an individual’s lifetime.

Additionally, this approach does not currently consider the full spectrum of environmental exposures. Limitations in the types of exposures assessed, and when they are collected, restricts thorough understanding of all the environmental components involved in the development of complex diseases such as T2D.

Environment-wide association studies allow the testing of multiple environmental exposures for association with common disease. Here, we demonstrate the utility of this approach for research using health record data, a novel use for this type of resource. Using this systematic

EWAS approach, exposures will be identified as potential causative agents for complex traits.

Gene-environment interaction models were identified using EWAS as a filtering tool. Similar to the PheWAS method, the EWAS approach can be used to test for association between a diverse array of exposures and numerous phenotypes to discover the types of exposure that are associated with multiple traits. The search for interactions between environmental variables and genetic loci, as well as the independent exposures involved in multiple traits, will further elucidate the genetic and environmental architecture of complex human phenotypes.

76 Funding Acknowledgements

This work was supported in part by NIH U01 HG004798 and its ARRA supplements and by NIH grants HG006389 and HG006385 in addition to an administrative supplement from PhenX

RISING (NOT-HG-11-009). We would like to thank Dr. Geraldine McQuillan and Jody McLean for their help in accessing the Genetic NHANES data. The findings and conclusions in this report are those of the authors and do not necessarily represent the official position of the Centers for

Disease Control and Prevention.

77 CHAPTER 4

KNOWLEDGE-DRIVEN METHOD FOR ASSESSING GENETIC INTERACTIONS

* This chapter is adapted from Hall, M. A. et al. Biology-Driven Gene-Gene Interaction Analysis of Age- Related Cataract in the eMERGE Network. Genet. Epidemiol. 39, 376–84 (2015)

78 Abstract

Bioinformatics approaches to examine gene-gene models provide the means to discover interactions between multiple genes that underlie complex disease. Extensive computational demands and adjusting for multiple testing make uncovering genetic interactions a challenge.

Here, we address these issues using our knowledge-driven filtering method, Biofilter, to identify putative SNP-SNP models for cataract susceptibility, thereby reducing the number of models for analysis. Models were evaluated in 3,377 European Americans (1,185 controls, 2,192 cases) from the Marshfield Clinic, a study site of the Electronic Medical Records and Genomics (eMERGE)

Network, using logistic regression. All statistically significant models from the Marshfield Clinic were then evaluated in an independent dataset of 4,311 individuals (742 controls, 3,569 cases), using independent samples from additional study sites in the eMERGE Network: Mayo Clinic,

Group Health/University of Washington, Vanderbilt University Medical Center, and Geisinger

Health System. 83 SNP-SNP models replicated in the independent dataset at LRT p < 0.05.

Among the most significant replicating models was rs12597188 (intron of CDH1) – rs11564445

(intron of CTNNB1). These genes are known to be involved in processes that include: cell-to-cell adhesion signaling, cell-cell junction organization, and cell-cell communication. Further Biofilter analysis of all replicating models revealed a number of common functions among the genes harboring the 83 replicating SNP-SNP models, which included signal transduction and PI3K-Akt signaling pathway. These findings demonstrate the utility of Biofilter as a biology-driven method, applicable for any GWAS dataset.

79 Introduction

Robust computational methods to explore gene-gene interactions are essential for elucidating the complex nature of common human traits. Genome-wide association study (GWAS)12 has been the traditional paradigm for identifying main effects of genetic variants across the genome for one or more phenotypes and has yielded insufficient explanation about variation of common, complex traits13,20,21. Genetic interaction analysis offers an additional tool for exploring genetic association, and testing models that allow for interactions between genetic variants reflects the complex nature of biology19–21. Gene products do not function in isolation; rather, they physically interact with other proteins, perform regulatory roles, and operate dynamically in one or more pathway.

Bioinformatics methods have expanded in recent years to include searches for genetic interactions, yet many challenges remain for these analyses including extensive computational and time requirements as well as a high penalty for correction of multiple comparisons when exhaustively testing pairwise combinations of all genome-wide SNPs. One method for overcoming this challenge is to filter the number of loci investigated, thus reducing the number of tests. Two main strategies for filtering include: (1) limiting the interaction analyses to only include one or more variants that have demonstrated an association with the trait through GWAS or candidate gene studies; and (2) filtering SNPs based on biologically-established gene-gene interactions93.

Knowledge-based filtering decreases the investigation search space to biologically related gene pairs. Previous studies have shown success in applying prior knowledge to genetic interaction analyses92,95,96. Biofilter software91,106,107 was developed to decrease the search space required for investigating genetic interactions using knowledge across numerous biological databases and has already been adopted for use in studies of complex diseases and traits such as multiple sclerosis 97,98, HIV pharmacogenetics99, HDL cholesterol102, and other lipid traits

(Holzinger et al., submitted; De et al., submitted). Biofilter utilizes biologically validated knowledge of the relationships between sets of genes to build pairwise SNP-SNP models from

80 functionally linked gene-gene pairs. This process takes advantage of biological knowledge of gene-gene relationships rather than requiring loci to have demonstrated a main effect, allowing for the detection of those variants that are only found to be associated with a given phenotype when acting in combination with another locus. In this study, we provided Biofilter with a set of genome-wide SNPs (~ 500,000). By accessing biological knowledge available from open-access pathway, ontology, protein interaction, and gene function online databases, Biofilter identified

400 knowledge-driven gene-gene models with approximately 260,000 SNP-SNP models corresponding to the 400 gene pairs. This process reduced the search space from the over 100 billion SNP-SNP models required for an exhaustive pairwise analysis. Filtering based on knowledge decreases the investigation to only gene pairs that have established biological relationships with one another.

We applied this method to age-related cataract, which is the leading cause of blindness worldwide173 and is responsible for approximately 60% of Medicare costs related to vision174.

Summary prevalence estimates indicate that 17.2% of Americans that are 40 years and older have cataract in either eye and 5.1% have had pseudophakia/aphakia (previous cataract surgery).

Several loci have previously been found to be associated with age-related cataract, and it has been suggested that as many as 40 genes may be involved175. Our recent GWAS of age-related cataract revealed novel loci associated with this trait using electronic medical record (EMR) data176.

Despite the identification of numerous associated loci, the molecular mechanisms that lead to age-related cataract remain unclear177. However, many genes that have been implicated function together in pathways and interact with one another178–181. Investigation of SNP-SNP interactions for age-related cataract is relevant, given the molecular complexity involved in lens development and maintenance. A recent exploratory gene-gene interaction analysis of age-related cataract implicated genetic interactions in the genetic etiology of the complex trait101. However, no replication in a separate dataset was performed in those analyses for validation of results. Here,

81 we present findings of the first genetic interaction study for age-related cataract with replication across two separate studies.

Using PLATO (Hall et al., in preparation) software, we tested the Biofilter-generated

SNP-SNP models for association with age-related cataracts in discovery and replication datasets as part of the NHGRI funded electronic MEdical Records & GEnomics (eMERGE)

Network156,182,183. We identified 83 SNP-SNP models that replicated across the discovery and replication datasets. The results discussed in this chapter demonstrate the utility of Biofilter as a robust method for elucidating the genetic interactions underlying complex traits such as age- related cataract.

Methods

Phenotypic Data

The eMERGE Network implemented an electronic phenotype algorithm to select cataract cases and controls. Age-related cataract as a phenotype was selected by Marshfield Personalized

Medicine Research Project (PMRP) as its primary phenotype. The algorithm, which uses diagnostic (ICD-9) and procedure codes (CPT) as well as natural language processing (NLP), was developed by the Marshfield PMRP investigators184. The five participating study sites from eMERGE included in this study are Marshfield PMRP185, Group Health/University of

Washington, Vanderbilt University186, Mayo Clinic from eMERGE I, and Geisinger Health

System from eMERGE II. In eMERGE I, each of the participating studies applied electronic phenotyping algorithms to identify cases and controls of a specific disease or individuals with a specific phenotype based on their respective electronic medical records (EMRs). DNA samples from individuals selected for study were then genotyped for the original phenotype of interest, and these same individuals were available for additional electronic phenotyping with new algorithms including age-related cataract. No additional GWAS-level genotyping was performed in eMERGE II; thus, additional eMERGE study sites joined the network with existing GWAS

82 data available on study participants linked to EMRs which enabled additional electronic phenotyping182.

Cataract cases and controls had to meet the following inclusion criteria: Cases - aged 50 years and older at the time of diagnosis or surgery, and Controls - ages 50 years or older at the time of most recent eye exam and had an eye exam in the previous five years. Controls had no diagnostic codes for cataract or evidence of cataract surgery. Cases were identified as “surgical” or “diagnosis-only”. Surgical cases had undergone a cataract extraction in at least one eye.

Diagnosis-only cases were required to have either cataract diagnoses on 2 or more dates, or have

1 diagnosis date and one or more mention of cataracts identified by natural language processing

(NLP) of electronic chart notes or paper as described by Peissig et al. (2012)184. Validation of case/control status was conducted at each site through manual abstraction of random samples of patient charts.

All participants were collected at their respective eMERGE site with appropriate patient protections and IRB protocols in place.

Genotypic Data

Genome-wide genotyping was been performed on approximately 18,000 samples across the eMERGE I study sites at the Broad Institute and at the Center for Inherited Disease Research

(CIDR) using the Illumina 660W-Quad or 1M-Duo BeadChips. DNA samples from Marshfield

Clinic, Group Health/University of Washington, Mayo Clinic, and Vanderbilt University were genotyped using the Illumina 660W-Quad array. The eMERGE discovery dataset pre-quality control (QC) included 3,912 (1,356 controls, 2,556 cases) from the Marshfield Clinic. The pre-

QC replicating dataset included 2,345 samples (110 controls, 2,193 cases) from Group

Health/University of Washington, 952 (346 controls, 606 cases) from Mayo Clinic, and 185 (80 controls, 105 cases) from Vanderbilt University. Added to the replication dataset were samples from eMERGE II by the Geisinger Health System, genotyped on Illumina Human Omni Express

83 (875 pre-QC samples: 221 controls, 654 cases). Due to incomplete overlap of SNPs genotyped on the Illumina 660W Quad platform and the Omni HumanExpress platform, we used imputed data for the Geisinger samples and genotype data for all other sites. In eMERGE, genetic data is imputed to 1000 genomes reference panel (March 2012 release)187. Imputation for all eMERGE sites was performed on datasets separated by site and platform188 using IMPUTE2 software189 on the phased genotyped data (SHAPEIT2 was used for phasing)190. For the purpose of this study, we used hard calls derived from imputed data where the genotypes with a probability score > 0.9 were reported in PLINK191 binary files.

Data were cleaned using the eMERGE QC pipeline developed by the eMERGE

Genomics Working Group161. This process includes evaluation of sample and marker call rate, sex mismatch, duplicate and HapMap concordance, batch effects, Hardy-Weinberg equilibrium, sample relatedness, and population stratification. For the discovery dataset, QC thresholds included: marker call rate > 99%, sample call rate > 99%, and minor allele frequency (MAF) >

5%. For the replication dataset, QC thresholds included: marker call rate > 98%, sample call rate

> 99%, and there was no MAF threshold so as to allow for testing of the highest number of quality variants in the replication analysis. After QC, 3,377 samples and 499,456 SNPs were used for discovery analysis, and 4,311 samples and 1,930 SNPs were included for replication. The

SNPs included for replication analysis were those found in significant models among the discovery dataset only. The 3,377 samples from the Marshfield PMRP included: 3,350 European

Americans, 1 African American, 8 Hispanic Americans, and 18 samples of other descent. The

4,211 samples in the replication dataset included: 2,330 samples from Group Health (2,143

European Americans, 81 African Americans, 11 Hispanic Americans, and 95 samples of other descent), 923 samples from Mayo Clinic (894 European Americans, 7 African Americans, 2

Hispanic Americans, and 20 samples of other descent), 183 samples from Vanderbilt University

(158 European Americans, 22 African Americans, and 3 samples of other descent), and 875 samples from Geisinger Health System (866 European Americans, 5 African Americans, and 4

84 Hispanic Americans). All genotype data and a detailed QC report for each individual study site, as well as the merged eMERGE dataset can be found on dbGaP.

Biofilter

Biofilter was developed for high-throughput annotation, model building, and filtering of genetic data through automated access to multiple biological databases. Biofilter software is open source and freely available for non-commercial research institutions. For more information see: http://ritchielab.psu.edu/ritchielab/software/.

Biofilter accesses several publicly available biological knowledge databases through the external database compiler called the Library of Knowledge Integration (LOKI). Data sources utilized by Biofilter, and compiled through LOKI, include information about biological networks, connections, and/or pathways for determining relationships between genes. Sources compiled within LOKI include: the Kyoto Encyclopedia of Genes and Genomes (KEGG)192, Reactome193,

Gene Ontology (GO)109, protein families database194, NetPath110, Biological General Repository for Interaction Databases (BioGrid)195, and the Molecular INTeraction Database (MINT) 196.

Additionally, Biofilter maps SNPs to genes using knowledge from the National Center for

Biotechnology (NCBI) dbSNP197198 database.

Building SNP-SNP models with Biofilter for our analyses involved several steps. First,

QC-filtered SNPs were mapped to genes using Biofilter with a 50kb window upstream and downstream of each gene to encompass potential regulatory regions close to the genes. Gene- gene pairs were established by Biofilter using knowledge from the databases within LOKI

(Figure 4.1A&B). It was required that a link between a gene-gene pair be validated in five or more separate databases (implication index ≥ 5) within LOKI for a gene-gene model to be considered for analysis. All gene-gene models were generated because of their connections to one another, independent of phenotype. Once gene-gene models were verified by five or more

85 databases, pairwise SNP-SNP combinations of all loci within each gene-gene model were created

(Figure 4.1C) and output for regression using PLATO software.

Figure 4.1. Steps involved in generating Biofilter SNP-SNP models. A) Biofilter accessed LOKI-compiled databases with information about connections between genes (for the example shown here: connections within a pathway). B) Biofilter generated gene-gene models based on connections between genes that were validated by 5 or more databases. C) For each gene-gene model, pairwise SNP-SNP models were created for each unique combination of loci across a gene pair.

Statistical Analyses

Pairwise SNP-SNP model tests of association were performed using logistic regression with

PLATO assuming an additive genetic model for 259,845 Biofilter-generated SNP-SNP models in the Marshfield Clinic discovery dataset. To determine the significance of the interaction term, we performed a likelihood ratio test (LRT) between the full (Y = β0 + β1SNP1 + β2SNP2 +

β3SNP1×SNP2) and reduced (Y = β0 + β1SNP1 + β2SNP2) models. We calculated principal components (PCs) using program Eigenstrat198 to identify any potential population substructure and adjusted our analyses for the first three PCs, sex, and year of birth.

The independent replication dataset included samples from Group Health/University of

Washington, Vanderbilt University, Mayo Clinic, and Geisinger Health System. We targeted an initial set of 2,452 SNP-SNP models that passed a LRT p-value threshold of p<0.01 in the

Marshfield discovery dataset for replication. There were 2,149 unique SNPs in the 2,452 discovery-significant SNP-SNP models. After applying a QC filter (marker call rate: 98%) in the

86 replication set, 1,930 SNPs remained, and thus, 2,092 SNP-SNP models (of the 2,452 discovery- significant models) were available for testing in the replication dataset. All methods and adjustments used for the discovery dataset were applied for the replication analyses in addition to adjusting for study site. A flow through of the study design is displayed in Figure 4.2.

Permutation tests were performed for all 2,092 SNP-SNP models in the replication dataset. For permutation, the phenotype was randomly shuffled 1,000 times and a LRT p-value was calculated for each model per 1,000 permutations. The permuted p-value for each SNP-SNP model was determined as the fraction of times any permuted p-value had a lower p-value than the LRT p- value derived from the natural phenotype.

Figure 4.2. Flow chart of steps in the discovery and replication analyses.

Results

In the discovery dataset, 2,452 SNP-SNP models were significant with a LRT p-value < 0.01, and these were tested in the independent replication dataset. Of these, 83 models were significant in the replication dataset with a LRT p-value < 0.05 (Appendix Table 4.1). Additionally, for all of the 83 models, the permuted p-value was ≤ 0.05. There were 22 SNP-SNP models that replicated with an LRT p-value < 0.01 (Figure 4.3).

87

Figure 4.3. All replicating SNP-SNP models with LRT p < 0.01 in both the replication and discovery datasets. SNP- SNP models are shown above with the –log10 of the p-value in the track directly beneath (discovery values are in blue and replication values are in red).

Thirteen replicating models were significant with a LRT p-value < 0.001 in the discovery sample and three models with LRT-P < 0.001 in the replication sample. Figure 4.4 shows the replicating SNP-SNP models with the 10 lowest LRT p-values for the discovery (Figure 4.4A) and replication (Figure 4.4B) datasets.

88 A B

Figure 4.4. Ten most significant replicating SNP-SNP models, ranked by significance level in the discovery (A) and replication (B) samples. For both figures, the SNP-SNP models and their nearest genes are listed to the left. The track to the right of each displays the –log10 of the p-value for the discovery (blue) and replication (red) groups. Figures were made using Synthesis View.

The SNPs within the model with the lowest LRT p-value in the discovery group were rs2303436 (a missense SNP in DLAT) and rs9811074 (near PDHB) (discovery LRT-P = 2.9×10-4, replication LRT-P = 0.013) (Figure 4.4A). Other significant models in the discovery group included intronic SNP rs9320004 (KIAA1468) and rs527459 (542bp 3’ of PIGO) as well as rs10789856 (intron of DIXDC1) and rs9811074 (near PDHB).

The replicating SNP-SNP model with the lowest LRT p-value in the replication sample was rs1011173 (intron of ACSBG1) and rs6037336 (near EBF4) (discovery LRT-P = 0.0031, replication LRT-P = 3.9×10-4) (Figure 4.4B). Other top SNP-SNP models were rs4333645 (near

TMEM249) and rs2025072 (intron of CPSF2) as well as rs12597188 and rs11564445 in CDH1 and CTNNB1, respectively.

In order to identify common function across genes in all replicating models, SNPs in every replicating SNP-SNP model were mapped to their closest gene. Ninety unique genes were identified as harboring SNPs in the 83 SNP-SNP models. These 90 genes were subsequently annotated with all group information (such as pathway) using Biofilter. Groups linked to the largest number of genes included: signal transduction (16 genes), adaptive immune system (12 genes), pathways in cancer (12 genes), innate immune response (11 genes), apoptosis (10 genes),

DNA replication (10 genes), extracellular vesicular exosome (9 genes), microRNAs in cancer (9 genes), positive regulation of vell proliferation (9 genes), proteoglycans in cancer (9 genes),

89 PI3K-Akt signaling pathway (8 genes), EGFR signaling pathway (8 genes), and focal adhesion (8 genes). Figure 4.5 shows two examples of these common groups and the genes with which they were annotated.

A B

Figure 4.5. Common groups relating to genes in replicating SNP-SNP models. Figures display two of the most common groups (yellow) and the genes that Biofilter annotated with that group (blue): A) signal transduction and B) PI3K-Akt signaling pathway. Solid lines indicate group-gene connection, dotted line indicates gene-gene connection from the interaction analysis. Plots were generated using Cytoscape199 software.

Finally, we used the Tissue-specific Gene Expression and Regulation (TiGER) database200,201 to determine how many of the genes in our models are expressed in the eye.

Though cataracts develop in the lens specifically, no comprehensive analysis of gene expression in the lens has been published, to the authors’ knowledge, so we focused on gene expression in the eye. We found that, of the identified 90 genes, 61 (~68%) are expressed in the human eye.

This is a far greater percentage than the fraction of all genes that are expressed in the eye (289) out of all the genes that were analyzed for the compilation of the database (~20,000), which is

~1%. Thus, we see a greater proportion of genes expressed in the eye represented in our final set of genes.

Discussion

In this first replication study of gene-gene interactions associated with age-related cataract, we found 83 SNP-SNP models that replicated across two independent datasets with a

90 likelihood ratio test p-value less than 0.01 in the discovery sample and 0.05 in the replication sample. Many of the replicating SNP-SNP models were in or near genes expressed in the eye and/or relating to lens development and maintenance as well as the development of cataracts, as further described below.

The anterior surface of the lens is made up of a single layer of epithelial cells that divide and differentiate throughout life into fiber cells, which make up the largest part of the lens202. As differentiation occurs, fiber cells experience unique loss of the nucleus and other organelles in addition to high expression of crystallin proteins, both of which are essential for transparency of the lens as well as a high refractory index203. Because the lens is an avascular tissue and fiber cells lack organelles, cell-to-cell junctions, both among fiber cells as well as between fiber cells and lens epithelial cells, are crucial for cell maintenance and survival including nutrient delivery and metabolic waste removal204. Both gap junctions and adherens junctions are present in lens cells205. Studies have shown that mutations in genes encoding gap junction connexin (Cx) proteins have led to cataracts in mice206,207 and are associated with cataract development in humans208,209. Adherens junctions and their components, classical cadherins and interacting β- catenin, play a crucial part in lens development and maintenance as well179,205,210.

Among the replicating model with the lowest LRT p-value in the replication dataset were two intronic SNPs, rs12597188 and rs11564445 which are in cadherin 1, type 1, E-cadherin

(CDH1) and catenin (cadherin-associated protein) beta 1 (CTNNB1), respectively (Figure 3B).

CDH1 encodes E-cadherin, a calcium-dependent glycoprotein that maintains epithelial cell-cell adhesion at adherens junctions211, and CTNNB1 encodes β-catenin, which acts as an anchor protein for E-cadherin so as to maintain a connection to intracellular actin. In addition to its role in adherens junction formation, β-catenin also has known signaling functions179. β-catenin has been shown to translocate to the nucleus and activate transcription in complex with lymphoid enhancer-binding/T cell factor (LEF/TCF) in response to Wnt signaling212. The Wnt/β-catenin pathway is known for regulating cell proliferation, differentiation, as well as migration213. Normal

91 Wnt/β-catenin signaling is thought to be essential in the formation and maintenance of the lens epithelium179. The pathway’s response to transforming growth factor beta (TGFβ) induction has been implicated in epithelial-mesenchymal transition (EMT)181, an event that has been shown to lead to posterior capsular opacification (PCO), also known as secondary cataracts, in humans214,215. This process includes loss of cell polarity and cell-cell adhesion, which involves down-regulation of E-cadherin, transcriptional reprogramming, and migration.

Another pathway involved in induction of EMT by TGFβ is the phosphatidylinositol-3- kinase (P13K)/Akt pathway, which has demonstrated importance in down-regulation of connexin-43216. We found eight genes that harbor the replicating SNP-SNP models that were annotated with the PI3K/Akt pathway group (Figure 5.5B). These results reinforce previous findings on the importance of typical function of E-cadherin, β-catenin, and the PI3K/Akt pathway in lens maintenance.

Additional growth factors are crucial for lens development and maintenance. The aqueous humour provides lens cells with growth factors including FGF, IGF, PDGF, and EGF, and these are important for lens structure and polarity179. Further, it is thought that these factors regulate cell proliferation via the MAPK/Erk and PI3K/Akt pathways. “Signal Transduction” was among the two most common groups, with sixteen genes relating to it (Figure 5.5A). Signal transduction can be considered a somewhat generic group into which a large number of proteins fall. Nonetheless, the genes found to relate to this group in our study are involved in specific transduction events known to be related to cataract. Gene-gene models found to relate to signal transduction here were NOTCH1, which has demonstrated involvement in lens development217 and NOTCH4 as well as EGF and EGFR. The intronic SNPs of EGF and EGFR, rs3796947 and rs6954351, respectively, were among the five most significant replicating models in the discovery dataset. Epidermal Growth Factor (EGF) encodes a mitogenic factor that acts by binding to the epidermal growth factor receptor, encoded by Epidermal Growth Factor Receptor (EGFR). EGF and EGFR are part of both the MAPK/Erk and PI3K/Akt signaling pathways. Both factors are

92 important for epithelial cell proliferation, and previous findings have demonstrated that EGFR

RNAi treatment suppresses proliferation of lens epithelial cells following cataract surgery in rats218.

Some limitations to our method may have decreased our ability to identify additional genetic interactions predictive of age-related cataract. The current application of Biofilter focuses on building models from protein-coding gene regions. Future additions to the software, including incorporation of regulatory regions, will allow identification of loci that fall outside of the 50kb gene window that may still be involved in the expression of a trait. The challenge of genetic heterogeneity has yet to be addressed with this method as well. If there are multiple disease loci spread across subsets of cases, we would have had little power to detect them. Methods for binning variants in genes and/or pathways may increase our ability to identify more genetic interactions. Additionally, the current approach considered genetic variation without allowing for interactions with the environment. Incorporating exposure data with this approach will further elucidate the complex underpinnings of age-related cataract.

The results described in this study are consistent with previous findings relating to lens cell maintenance and structure as well as cataract development. Use of Biofilter decreased the search space to identify and replicate putative SNP-SNP combinations. These results demonstrate the role of genetic interactions in the development of complex phenotypes like age-related cataract. Other genetic epidemiology studies would benefit from the annotation, filtering, and model building functions of Biofilter.

Funding Acknowledgements

This study was supported by the following U01 grants from the National Human Genome

Research Institute (NHGRI), a component of the National Institutes of Health (NIH), Bethesda,

MD, USA: U01HG004608 (Marshfield Clinic); U01HG006389 (Essentia Institute of Rural

93 Health & Marshfield Clinic Research Foundation); U01HG004610 and U01HG006375 (Group

Health/University of Washington); U01HG04599 and U01HG006379 (Mayo Clinic);

U01HG04603 and U01HG006378 (Vanderbilt University); U01HG006385 (Vanderbilt

University); U01HG004438 (CIDR, Johns Hopkins University); U01HG004424 (The Broad

Institute); and U01HG006382 (Geisinger Health System). Additional support was provided by

UO1AG06781 and HG006828. We would like to thank Dr. Dokyoon Kim for his assistance in generating Figure 4.1.

94 CHAPTER 5

DATA-DRIVEN WEIGHTED ENCODING: A ROBUST APPROACH FOR

DETECTING DIVERSE GENETIC ACTION

95 Abstract

There are certain assumptions made about the biological action of a SNP when choosing a traditional genetic encoding. For each encoding, risk incurred by one copy of the alternate allele in relation to two copies varies: the heterozygous genotype is coded to bear 0%, 50%, and 100% the risk of homozygous alternate for recessive, additive, and dominant encodings, respectively.

However, the heterozygous action for a given SNP may incur any portion of the risk of the heterozygous alternate genotype. Further, SNPs across the genome are unlikely to demonstrate the same genetic action. Choosing just one encoding, therefore, is not flexible to the diversity of action in biology; while running every encoding raises the multiple testing burden. We present a novel, robust alternative: the data-driven genetic encoding. Here, a heterozygous value is assigned, based on the action a SNP demonstrates in the dataset, thereby providing an individualized encoding for every SNP. We compared power of this method to detect main effect and genetic interactions using a comprehensive combination of simulated genetic models and found the novel method to outperform the traditional methods for the highest number of underlying models across varying minor allele frequencies (MAFs). We additionally compared the method to the codominant encoding and found that for situations in which the codominant encoding loses power (for low MAF) our method retained sufficient power. We further tested our method with null and main effect-only simulated datasets and found that our method maintained a low false positive rate for likelihood ratio test (LRT) p-value, while the additive and dominant encodings demonstrated inflation. Finally, we applied our method to natural data from the

Electronic Medical Records and Genomics (eMERGE) Network using age-related macular degeneration (AMD) cases (2,167) and controls (10,986). We identified 438 SNP-SNP models that were significant when allowing for a 5% false discovery rate. One top model included genes that encode CFH and C2, both with previously reported involvement in AMD. Our results demonstrate that caution is needed when using the additive and dominant encodings for interactions due to the inflation we observed. We demonstrate the utility of our novel encoding

96 method at flexibly assigning action to individual SNPs, identifying interactions between SNPs with diverse action, and uncovering examples of SNP-SNP interaction in GWAS data.

Introduction

Choosing between the traditional methods for coding genotypes in association studies, including additive, dominant, and recessive, requires making an assumption about the manner in which the risk allele acts. Given referent allele, A, and alternate allele, a, all encodings assume that the AA

(homozygous referent) genotype incurs no risk and aa (homozygous alternate) bears full risk. The heterozygous (Aa) coding varies, however, according to the chosen method. For each encoding, risk incurred by one copy of the risk allele in relation to two copies varies: the heterozygous genotype is coded to bear 0%, 50%, and 100% the risk of homozygous alternate for recessive, additive, and dominant encodings, respectively. Yet, heterozygous action can, theoretically, fall anywhere from 0% to 100% the risk of heterozygous alternate genotypes (for examples of possible genotype action, see Table 5.1). Additionally, each locus across the genome is likely to display different biological action; therefore, choosing one encoding is restrictive. Yet, testing all types of action increases the multiple testing burden, thereby limiting the ability to identify true signals. An alternative is to use the codominant encoding, which makes no assumptions about the biological action of a marker; however, non-convergence is an issue when using this method for interactions in binary traits, especially for lower-frequency variants.

97 Table 5.1. Examples of possible proportional genotype risk underlying genetic loci.

Homozygous Homozygous Biological Heterozygous Referent Alternate Action (Aa) (AA) (aa)

Recessive 0% 0% 100%

Sub-Additive 0% 25% 100%

Additive 0% 50% 100%

Super- 0% 75% 100% Additive

Dominant 0% 100% 100%

Heterozygous 0% 100% 0%

A common approach to handle this challenge is to employ the additive encoding in the hopes that the majority of the genetic effects will be detected by this method alone. It follows that the results identified with the additive encoding are expected to be highly correlated with those found from the other encodings. Yet, little has been done by way of testing this assumption at the interaction level. We assessed the correlations between the additive, dominant, recessive, and codominant encodings in two natural GWAS datasets for main effect and interaction likelihood ratio test (LRT) p-values, and found that, not only does the type of encoding used impact the results at the main effect level, the correlation between results obtained using each encoding is much lower for interactions. We have developed an alternative method to flexibly assign an encoding to each single nucleotide polymorphism (SNP) based on the effect it displays in the dataset: the data-driven weighted encoding. In this chapter, we describe the results of the weighted encoding in comparison to the traditional methods when applied to simulated data across a diverse set of genetic underlying models, minor allele frequencies, sample sizes, and case-control ratios. We show that the weighted encoding is flexible and demonstrates robust power across diverse sets of parameters and low type 1 error in null data when compared to the

98 traditional approaches. Finally, we applied the weighted encoding to SNP-SNP interactions using a natural dataset with age-related macular degeneration (AMD) cases (2,167) and controls

(10,986) from the Electronic Medical Record and Genomics (eMERGE) Network.

This method seeks to shift the current paradigm by allowing detection of genetic interaction models with a diverse set of biological actions without employing every encoding.

Instead of assuming the action of each locus, the weighted encoding is flexible to each SNP’s demonstrated action. We recommend using the data-driven weighted encoding as an alternative to the additive, dominant, recessive, and codominant encodings to uncover the genetic etiology of common, complex traits.

Methods

Data-Driven Weighted Encoding

Like traditional encodings, the weighted method codes the homozygous referent and homozygous alternate genotypes as 0 and 1, respectively. For additive, dominant, and recessive methods, a predetermined heterozygous action is assumed (Table 5.2). Conversely, instead of providing a fixed heterozygous value, the weighted encoding assigns a flexible heterozygous value (α) based on the effect each variant displays in the dataset (Table 5.2). For example, if SNP A’s underlying biology is additive, the weighted encoding will assign the heterozygous genotype as 0.5, and if

SNP B demonstrates a sub-additive underlying model, the heterozygote will be coded as 0.25.

The heterozygous coding assignment can be any continuous value from 0 to 1 if the underlying model is between recessive and dominant. Further, the weighted encoding is designed to allow for a heterozygous value outside of 0 and 1 in the case that the SNP acts in a manner that is “over- dominant” (>1) or “under-recessive” (<0). This coding detection and assignment is performed prior to interaction analyses, allowing for SNP-SNP interaction coding that reflects the underlying action of each individual SNP.

99 Table 5.2. Recessive, additive, dominant, and weighted encoding schemes.

Biological Homozygous Homozygous Heterozygous Action Referent Alternate

Recessive 0 0 1

Additive 0 0.50 1

Dominant 0 1 1

Weighted 0 α 1

With the weighted method, each locus has a unique encoding, and this allows interacting

SNPs to have different biological actions. The weighted heterozygous value is calculated in the following steps:

1. Recessive and dominant models are tested for each locus using logistic or linear

regression and correcting for the number of tests (figure 5.1A).

2. Using the beta from the recessive (R) and the dominant (D) encodings, the weighted

value (α) for the heterozygous genotype is calculated (figure 5.1B).

3. These weighted encodings (homozygous referent = 0, heterozygous = α, homozygous

alternate = 1) are used for comprehensive SNP-SNP and/or SNP-exposure interaction

analyses

A B

Figure 5.1. Equations used to assign a data-driven weighted value for the heterozygous genotype. A) Recessive and dominant encodings are tested in a regression model. B) The heterozygous genotype value (α) is calculated using the effects of the dominant (D) and recessive (R) encodings.

Simulated Datasets

The simulation script generates two independent, biallelic SNPs in Hardy-Weinberg equilibrium according to given minor allele frequencies. With the genotype generated, a phenotype is determined using a traditional 9-cell penetrance table. The simulation script can

100 generate both quantitative and dichotomous phenotypes; in the case of quantitative phenotypes, the calculated dependent variable is a random sample from a normal distribution with variance 1 and a mean given by the penetrance table. In the case of dichotomous phenotypes, each cell in the penetrance table represents the probability of being affected, and rejection sampling is used to generate the requested number of cases and controls.

The penetrance table used in the simulation script may either be provided directly, or it can be generated according to the model: In the previous formula, Χi represents the given normalized effect for each genotype for SNP i. As an example, “0,0,1” represents a recessive effect, while “0,0.5,1” represents an additive effect.

In order to normalize the amount of detectable effect in the interaction term across different models, the simulation script also offers an option to automatically scale the penetrance table according to the amount of detectable signal of the interaction term using a codominant encoding. To do this, we construct a 9-row regression with each row corresponding to a cell in the penetrance table. The model matrix will be defined to be the intercept term along with the codominant encoding (note: Any linearly independent encoding will work for our purposes; we chose codominant for ease of understanding) of both SNPs, yielding 5 columns. The dependent variable is the normalized value of the penetrance table. Normalization in this case does not change the result, but it does allow for the same starting point when applying the adjustment factor. Any dependent variable that maintains collinearity following transformation, an affine transformation (multiply by a constant, then add a constant), of the given penetrance values will yield identical variance explained. The proof of this is left to the reader, but will hinge on the fact that the residuals of a regression will vary by a constant factor, which is canceled out when finding the standard deviation. An example is shown below:

101

Now, because we only have nine rows above, we can see that including the interaction terms between the two SNPs will cause the regression to fit the data perfectly, as there will be nine columns of independent variables (1 intercept, 2 each of 2 columns of main effects and 4 columns of interaction effects). Thus, any unexplained variance in the above model is due entirely to the missing interaction terms.

102 To account for the impact that minor allele frequency can have on the amount of signal, we perform a weighted linear regression on the above data, where the weights are defined by the probability of an individual having a given genotype under the assumption of two independent

SNPs in Hardy-Weinberg Equillibrium in a population of infinite size. The amount of “signal” available to find is determined by the weighted residual standard error. In the case of a perfectly predictive regression, as would be the case in a model with only main effects, this process is repeated with the model matrix consisting of only the intercept term. The weighted residual standard error is defined as:

Where ri is are the residuals and the wi are the weights, as defined previously.

In the case of simulating a quantitative phenotype, the new penetrance table is defined to be the normalized penetrance table times for some configurable value k, defined to be the

“signal to noise ratio.”

In the case of simulating a dichotomous phenotype, we want the probabilities calculated such that the odds ratio between the least penetrant cell and the most penetrant cell is our expression as above. Additionally, we want the least penetrant cell and the most penetrant cell to be equally close to 0 and 1, respectively, with all of the probabilities scaling linearly. We define the minimum probability as:

, and then the new penetrance table is now defined as:

103 We simulated interactions with no main effect between SNPs with comprehensive combinations of underlying genetic models, including all two-way combinations between the following models: additive (ADD), dominant (DOM), recessive (REC), sub-additive (SUB), super-additive (SUP), and heterozygous (HET). These are henceforth referred to as “traditional” interaction models. Additionally, we evaluated simulated interactions using “genotype-based” interaction models that include penetrance functions (ie: XOR, Hyperbolic) and situations in which only one of the 9 interaction penetrance cells confers risk while the other 8 demonstrate no risk (ie: Homozygous Referent-Homozygous Referent cell) (Table 5.3 lists all genotype-based models). Our interaction analysis involved simulation of 29 interaction-only models with no main effect, and we initially simulated these datasets with MAF: 30%, sample size: 2,000, and case- control ratio: 1 to 1. Additionally we simulated a baseline risk of 25% and maximum risk of 50% to decrease simulations with complete separation. Finally, to assess the impact of the aforementioned parameters, we performed a parameter sweep by simulating our models with all combinations of the following: MAF: 0.05, 0.1, 0.3, 0.5; sample size: 2,000, 10,000, and 20,000; case-control ratio: 1:1, 1:3, and 3:1; and with varying Baseline to maximum risks.

104 Table 5.3. Genotype-based models.

HR-HR HR-HET HR-HA HET-HET HET-HA HA-HA XOR Hyp RHyp

AABB 1 0 0 0 0 0 1 0 1

AABb 0 1 0 0 0 0 0 0.5 0.5

AAbb 0 0 1 0 0 0 1 1 0

AaBB 0 0 0 0 0 0 0 0.5 0.5

AaBb 0 0 0 1 0 0 1 0.5 0.5

Aabb 0 0 0 0 1 0 0 0.5 0.5 aaBB 0 0 0 0 0 0 1 1 0 aaBb 0 0 0 0 0 0 0 0.5 0.5 aabb 0 0 0 0 0 1 1 0 1 HR: Homozygous Referent HET: Heterozygous HA: Homozygous Alternate XOR: XOR Model Hyp: Hyperbolic Model RHyp: Hyperbolic Model X: Risk genotype combination

To test for type I error, we simulated 1,000 null signal datasets. In addition, to assess inflation of each encoding under models with main effect signal without interaction signal, we simulated a number of datasets under two scenarios. The first involved simulating two SNPs, both exhibiting a main effect but having no interaction effect (“Two-SNP Main Effect”). The second involved simulating two SNPs in which only one SNP exhibited a main effect and the other did not and with no interaction effect (“One-SNP Main Effect”). For the Two-SNP Main Effect models, we simulated underlying biological model combinations between REC, SUB, ADD,

SUP, DOM, and HET. For the One-SNP Main Effect models, we simulated datasets allowing for one SNP to exhibit each of the above biological models.

Natural Datasets

Genome-wide genotyping was performed on approximately 18,000 samples across the eMERGE

I study sites at the Broad Institute and at the Center for Inherited Disease Research (CIDR) using

105 the Illumina 660W-Quad or 1M-Duo BeadChips. In eMERGE, genetic data is imputed to 1000 genomes reference panel (March 2012 release). Imputation for all eMERGE sites was performed on datasets separated by site and platform using IMPUTE2 on the phased genotyped data

(SHAPEIT2 was used for phasing). Data were cleaned using the eMERGE QC pipeline developed by the eMERGE Genomics Working Group. This process includes evaluation of sample and marker call rate, sex mismatch, duplicate and HapMap concordance, batch effects,

Hardy-Weinberg equilibrium, sample relatedness, and population stratification. Correlation analyses of main effect and interaction results were performed for age-related cataracts (1,185 controls and 2,192 cases) using non-imputed genotype data. We applied our weighted encoding method to natural data using age-related macular degeneration (AMD) cases (2,167) and controls

(10,986) using imputed data. Study sites from eMERGE included in the AMD analysis were

Marshfield Clinic, Group Health, Geisinger, Northwestern, and the Mayo Clinic.

15,737 samples were included for correlation assessment of body mass index (BMI) from combined datasets including: Atherosclerosis Risk in Communities (ARIC) Study, Multi-Ethnic

Study of Atherosclerosis (MESA), Cardiovascular Health Study (CHS), Framingham Heart Study

(FHS), and Coronary Artery Risk Development in Young Adults (CARDIA) Study. Each of these datasets were obtained in collaboration with the Children’s Hospital of Pennsylvania.

Statistical Analyses

To assess the correlation between the main effect results obtained from additive, dominant, recessive, and codominant encodings, we considered two phenotypes to test linear and logistic regression. For linear regression, we tested SNPs for main effect association with BMI

(adjusting for age, sex, a high effect SNP in FTO, and the first 3 principal components). For logistic regression, we assessed SNPs for association with age-related cataract from the

Marshfield Clinic (adjusting for age, sex, and 3 principal components). For both phenotypes, we ran associations with all four encodings, calculated Pearson’s correlations, and plotted correlation

106 matrices. For the interaction correlations, we applied a filtering method to reduce the number of pairwise tests. We employed a number of filtering methods so as not to bias the results to one approach: 1) strictly LD pruning (r2=0.1), 2) running associations with the remaining SNPs using all four encodings using regression, 3) selecting the 500 results with the lowest p-values for each encoding, and 4) selecting SNPs likely to be biologically related using Biofilter software, 5) randomly selecting SNPs using the “thinning” feature in PLINK191. We then tested pairwise combinations of interactions between each SNP and derived a likelihood test (LRT) p-value by comparing the full and reduced models (as described below). Correlations between the LRT p- values for each encoding were calculated using the same method as described for main effect.

For all simulated and natural datasets, regression analyses were performed using PLATO software (Hall et al., in preparation). To determine the significance of a SNP-SNP interaction model above and beyond the main effects of both SNPs combined, we performed a likelihood ratio test (LRT) between the full (Y = β0 + β1SNP1 + β2SNP2 + β3SNP1×SNP2) and reduced (Y

= β0 + β1SNP1 + β2SNP2) models and derived a LRT p-value.

The application of the weighted encoding to AMD involved a main effect filter that was applied prior to interaction analysis. The codominant encoding was used for this filter (p <

0.00001), as it had the highest and most consistent correlation with the other encoding types

(Figures 5.2 and 5.3). Pairwise combinations of interactions were tested between all main effect filtered SNPs by performing the LRT as described above. For both main effect and interactions in

AMD, we adjusted for age, sex, site, genotype platform, and the first 6 principal components.

After compiling all power results from simulated datasets of our parameter sweep, we performed ANOVA tests by model for each encoding to identify effects of MAF, sample size, case-control ratio, and baseline-maximum risk ratio on power.

107 Results

In order to determine the impact of encoding choice on main effect and interaction results we assessed the additive, dominant, recessive, and codominant encodings using linear and logistic regression in body mass index (BMI) and age-related cataract, respectively. Figure 5.2 shows the results of the correlation of main effect p-values in age-related cataract and BMI. For both phenotypes, recessive and dominant demonstrated the lowest correlation for both cataract and

BMI (r2=0.06 for both phenotypes). Correlation values between recessive and additive main effect p-values were r2 of 0.37 for cataract and 0.28 for BMI. Additive and dominant results demonstrated the highest correlation, with an r2 of 0.69 for cataract and r2 of 0.78 for BMI.

Notably, the codominant method demonstrated the most consistent correlation with the other encodings. Next, we calculated the correlations for the LRT p-values between the traditional encodings (Figure 5.3). The general patterns observed for main effect correlations were maintained for the interaction results: the lowest correlation was between dominant and recessive encodings, additive and recessive showed lower correlation than additive and dominant, and codominant had the highest consistency across the different encodings. However, for every encoding combination, the correlation for interaction results was reduced compared to the main effect results. Spearman correlations were also calculated for both phenotypes, and the results were similar (data not shown).

108

Figure 5.2. Correlations between main effect results obtained using additive, dominant, recessive, and codominant encodings. Main effect p-values were obtained using each encoding and Pearson’s correlations were calculated and plotted in a matrix for age-related cataract (bottom left) and body mass index (top right).

Figure 5.3. Correlations between interaction results obtained using additive, dominant, recessive, and codominant encodings. LRT p-values were obtained using each encoding and Pearson’s correlations were calculated and plotted in a matrix for age-related cataract (bottom left) and body mass index (top right).

The results shown in Figures 1 and 2 demonstrate that the choice in encoding impacts the results for main effect; further, that effect is magnified at the interaction level. The data-driven weighted encoding was developed to flexibly assign each SNP an individualized encoding based on the action it demonstrates in the data. To ensure that the weighted encoding was, indeed, assigning the expected heterozygous genotype value across different types of underlying genetic models, we simulated main effect SNPs with the following underlying genetic models: REC,

109 SUB, ADD, SUP, and DOM (1,000 simulated datasets for each model type). Figure 5.4 shows the results when we plotted the distribution of the alpha values for each model type. For every model, the density plot peaks where we would expect (REC≈0, SUB≈0.25, ADD≈0.5, SUP≈0.75, and DOM≈1), which demonstrates that the method is effective at assigning an alpha value reflective of the underlying action of the heterozygous genotype at 30% MAF. We noted variation in the distribution of the alpha value density across the models, so we evaluated the effect of MAF on the alpha value and found that this had an impact on the value assigned

(Appendix Figure 5.1). This is likely because, despite the simulated signal, there is tendency for the alphas to trend toward the null, and the expected null distribution will also vary accordingly with MAF (Appendix Figure 5.2). Due to the evidenced effect of MAF, we chose to test subsequent analyses at varying allele frequencies.

Figure 5.4. Distribution of the estimated heterozygous action (α). Single SNP simulations were performed with five distinct underlying models: recessive (red), sub-additive (yellow), additive (green), super-additive (blue), and dominant (purple). 1,000 datasets for each SNP action were generated and the distribution of the alpha parameter was plotted.

Power calculations were performed to compare the weighted encoding with each of the traditional methods for interactions between simulated SNPs. Figure 5.5 shows the power for

110 every encoding to detect each of the 29 interacting models we simulated, where MAF for both

SNP was 0.3, for 1,000 cases and 1,000 controls with a baseline risk of 25% and maximum risk of 50% with 1,000 datasets generated for each model. Additive, dominant, and recessive encodings each outperformed the other encodings for models with which they are designed:

ADDxADD, DOMxDOM, and RECxREC, respectively. As seen in Figure 5.5, power to detect different models varied widely. For instance, no encoding identified any model in which one or two SNPs had a REC underlying model. Conversely, models which included SNP(s) with DOM or SUP action demonstrated consistently high power for most of the encodings. The recessive encoding demonstrated consistently low power and did not exceed 80% power for any model.

Figure 5.5. Power plot for each interaction model. Power was calculated as the percentage of the simulated models (out of 1,000 simulations for each model) each encoding detected interaction signal at LRT p-value < 0.05: additive (red circle), dominant (yellow square), recessive (green diamond), codominant (blue triangle), and weighted (purple inverted triangle). Twenty-nine interaction models were simulated. Above the solid line are “traditional” interaction models with comprehensive two-SNP combinations between SNPs with REC, SUB, ADD, SUP, and DOM action (between the solid and dotted line are models including SNPs with HET action). Below the solid line are “genotype-based” interaction models. The blue and red vertical lines marks 5% and 80% power, respectively.

111 To test whether our power results were specific to the chosen MAF, sample size, case control ratio, and baseline to maximum risk ratio, we performed a parameter sweep whereby we simulated all 29 interacting models with comprehensive combinations of each of the parameters.

ANOVA tests were employed to test for significant effects of each parameter on the encodings’ power to detect our models. As shown in figure 5.6, MAF and sample size demonstrated a significant effect on power for a large number of models for each encoding (MAF had by far the greatest impact), while case-control ratio and baseline to maximum risk ratio had an effect on very few models. Investigations into differences observed across varying MAFs and sample sizes revealed similar results for both parameters: with increasing MAF and sample size, the power for each encoding to detect the interaction signal tended to increase (Appendix Figure 5.3).

Figure 5.6. ANOVA results from parameter sweep. ANOVA tests were performed to determine the impact of MAF, sample size, baseline to maximum risk ratio, and case-control ratio on power for each interaction model for the additive (red), dominant (yellow), recessive (green), codominant (blue), and weighted (purple) encodings.

As seen in Figure 5.5, the power for an encoding to detect a given interaction model varies depending on the type of underlying genetic model. For instance, models involving at least

112 one SNP with REC action demonstrate reduced power when compared to models including DOM

SNP(s). Assessing average power for each encoding across the interacting models was not a possibility, as the power was not consistent across every underlying model type. Thus, we standardized signal to noise ratio across the different models in order to determine the patterns in average model p-values for each encoding. Figure 5.7 shows the plotted average LRT p-value for each model by encoding for a sample signal to noise ratio to exemplify how signal detection can be standardized across the various models. As an example, we see consistency for the average

LRT p-value plotted using the codominant encoding in this standardized signal to noise ratio

(Figure 5.7).

Figure 5.7. Average LRT p-value for interaction models at a standardized signal to noise ratio. Average LRT p-value was calculated for each encoding: additive (red circle), dominant (yellow square), recessive (green diamond), codominant (blue triangle), and weighted (purple inverted triangle). Twenty-nine interaction models were simulated. Above the solid line are “traditional” interaction models with comprehensive two-SNP combinations between SNPs with REC, SUB, ADD, SUP, and DOM action (between the solid and dotted line are models including SNPs with HET action). Below the solid line are “genotype-based” interaction models. The blue and red vertical lines marks LRT p-value of 0.5 and 0.05, respectively.

113

The average power for each encoding was compared across all traditional interacting models with increasing signal to noise ratio at 10%, 30%, and 50% MAF. At 30% MAF (Figure

5.8): the weighted encoding demonstrated the highest average power across traditional models with increasing signal to noise ratio; the codominant encoding showed comparable power to weighted until the signal to noise reached a threshold at which increasing numbers of models diverged, causing reduced power; the additive and dominant encodings demonstrated diminished power compared to the weighted; and recessive showed low power to detect models even at high signal to noise ratio levels. Increasing the MAF to 50% resulted in greater average power for weighted, codominant, and recessive, and reduced average power for dominant and additive when compared to 30% MAF. At 10% MAF we observed high average power for weighted, additive, and dominant encodings, while codominant and recessive showed exceedingly low power due to non-convergence.

114

Figure 5.8. Average power for standardized signal to noise ratio across the all traditional interaction models at 10% (A), 30% (B) and 50% (B) MAF. Average power was calculated for standardized signal to noise ratios and plotted at increasing signal to noise ratios for each encoding: additive (red), dominant (yellow), recessive (green), codominant (blue), and weighted (purple).

Similar patterns emerged for the genotype-based models. However, the codominant encoding only demonstrated reduced average power for the datasets simulated with 10% MAF and the additive, dominant, and weighted encoding showed somewhat diminished power at 50%

MAF as compared to that seen in the traditional models.

115

Figure 5.9. Average power for standardized signal to noise ratio across the all genotype-based interaction models at 10% (A), 30% (B) and 50% (B) MAF. Average power was calculated for standardized signal to noise ratios and plotted at increasing signal to noise ratios for each encoding: additive (red), dominant (yellow), recessive (green), codominant (blue), and weighted (purple).

To assess if there were any genetic scenarios under which the encodings demonstrated inflation where there was no interaction signal, we simulated models that were main effect-only as well as a completely null dataset at 30% MAF. We simulated one-SNP main effect-only models in which only one of the SNPs exhibits a main effect while the other does not demonstrate an effect on the phenotype. The main effect SNPs that were simulated included REC, SUB, ADD,

SUP, DOM, and HET. The two-SNP main effect-only models included all pairwise combinations between REC, SUB, ADD, SUP, DOM, and HET, both with main effect and no interaction effect.

We additionally simulated completely null samples with no main or interaction effects for 1,000 datasets. As shown in Figure 5.10, the majority of the encodings demonstrate a false positive rate

(FPR) near 5% for each model. Deviations from this were seen for the additive encoding, where inflation was observed, including the scenarios under which: 1) both SNPs exhibit a main effect under the dominant underlying model (FPR≅16%); 2) one SNP exhibits a heterozygous main

116 effect (FPR≅11%), and 3) SNP1 exhibits a dominant main effect and SNP2 exhibits a heterozygous main effect (FPR≅9%). FPR represents the percentage of the time an encoding identified an interaction term to be significant above and beyond the effect of each SNP when there is no simulated interaction.

Figure 5.10. Type 1 error in simulated main effect only and null data. Two-SNP (above the first solid line) and One-SNP (below first solid line) Main Effect Only models and null data (below the second solid line) were simulated (1,000 datasets each) and were LRT p-values were calculated using additive (red circle), dominant (yellow square), recessive (green diamond), codominant (blue triangle), and weighted (purple inverted triangle) to identify the percentage of time each encoding identified a significant interaction when there was none (false positive rate). The red vertical line marks a 5% false positive rate.

When we considered the average false positive rate among type I error datasets

(including the main effect only and null datasets) at standardized and increasing signal to noise ratios (Figure 5.11), we observed low type 1 error for codominant and weighted (following the

5% level closely) and recessive (below the 5% level at high signal to noise ratios). However, at higher signal to noise ratios, we observed inflation of type 1 error for the additive and dominant

117 encodings. Further inspection of the models in which each model demonstrated the highest type 1 error revealed exceedingly high type 1 error for additive and dominant encodings.

Figure 5.11. Average and maximum type 1 error across all main effect only and null simulated datasets. Average false positive rates (solid lines) were calculated for standardized signal to noise ratios as well as the false positive rate for the maximum inflated model (dashed lines) and plotted at increasing signal to noise ratios for each encoding: additive (red), dominant (yellow), recessive (green), codominant (blue), and weighted (purple). The black lines denote the expected values.

Finally, we applied our method to natural data from the Electronic Medical Records and

Genomics (eMERGE) Network using age-related macular degeneration (AMD) cases (2,167) and controls (10,986). We identified 438 SNP-SNP models that were significant when allowing for a

5% false discovery rate. One top model included genes that encode CFH and C2.

Discussion

Current methods for coding genotypes across genome-wide loci include additive, dominant, and recessive. A major issue with each of these methods is that they require assumptions about the way in which the risk allele acts. For instance, the additive encoding assumes the risk incurred by one copy of the alternate allele is precisely half the risk of having two. Circumstances for which one copy of the alternate allele incurs 80% (trending toward, dominance) or 20% (trending toward recessive) the risk of having two is not considered using these traditional approaches. The codominant model makes no assumptions about the biological action of a marker. Yet, non-

118 convergence is an issue when using this method for interactions, especially for lower-frequency

SNPs. Further, loci across the genome are unlikely to demonstrate the same genetic action; thus, choosing just one of these simplistic methods is too restrictive. In this chapter, we presented a novel, robust alternative, the data-driven weighted encoding, in which a flexible heterozygous value is assigned, based on the action demonstrated in the data.

Interaction models are traditionally encoded assuming an additive model, as this encoding was expected to capture effects of SNPs with many actions. If this assumption were true, we would expect the interaction results obtained using the additive encoding to be highly correlated with the results from others. Figures 5.2 and 5.3 reveal a different story, however, as they indicate that encoding type does demonstrate an effect for main effect results and an even greater effect on interaction results. These results demonstrate the need for a single method that accurately assigns a unique encoding for each SNP.

We assessed our method at accurately assigned heterozygous action to a given loucs by simulated main effect single SNPs with five possible models - recessive, sub-additive, additive, super-additive, and dominant - and assessed the heterozygous values the method assigned. Once these results (displayed in Figure 5.4) validated the method’s capability of flexibly assigning heterozygous action, we applied the weighted encoding to interactions and compared its performance to the traditional methods.

To do this, we designed comprehensive combinations of simulated genetic interaction and main effect only models and compared the performance of our novel weighted encoding to that of the traditional methods. As expected, additive, dominant, and recessive encodings demonstrated high power to detect the types of models for which they were designed (for example, the recessive encoding outperformed all other encodings for the model that included two interacting SNPs with recessive actions). With the exception of those expected scenarios, these methods varied widely in their power to detect other model types. Conversely, the weighted encoding demonstrated reasonable power across numerous types of models and was among the

119 highest for average power with every minor allele frequency tested. We additionally compared the method to the codominant encoding and found that, for situations in which the codominant encoding loses power (30% and 10% MAF and high signal to noise ratios), our method retained sufficient power. These findings suggest that the weighted encoding provides a robust alternative to the traditional methods at detecting interaction signal for SNPs of varying allele frequencies when the underlying genetic action is unknown.

We further tested our method with null and main effect-only simulated datasets and found that our method maintained a very low false positive rate for likelihood ratio test (LRT) p- value, while the additive and dominant encodings demonstrated inflation. The typical paradigm for testing genetic interactions employs the additive encoding. According to our results, a large portion of results published using this method could be false positive. Specifically, our results indicate that the additive encoding falsely identifies SNPs that exhibit main effect dominant action as having an interaction that predicts the outcome above and beyond the main effects when there actually is no simulated interaction effect. We are not arguing that the additive encoding is not identifying meaningful biology in these cases, as it is picking out models with signal; however, it is inaccurately identifying interaction when there is none.

Finally, we applied our method to natural data from the Electronic Medical Records and

Genomics (eMERGE) Network using age-related macular degeneration (AMD) cases and controls. We identified 438 SNP-SNP models that were significant when allowing for a 5% false discovery rate. One top model included genes that encode compliment factor H (CFH) and

Complement component 2 (C2), both with previously reported involvement in AMD4,219. Notably, though, this is the first example of interaction of the two genes together, to the best of the authors’ knowledge. Thus, this method has revealed potential insights into the complexity of this disease.

An important limitation to this method is that there is a demonstrated influence of minor allele frequency on the assignment of the estimated heterozygous value (Appendix Figure 5.1). At low allele frequencies, the value trends toward null (Appendix Figure 5.2). Despite this effect, the

120 weighted encoding maintained among the highest average power across models, even for variants with 10% MAF. Still, caution is encouraged when using the method for variants of lower allele frequencies, as it has not been tested. Genetic associations, however, are not ideal for these types of variants anyway, and it is, therefore, not expected that this method will show significantly reduced power compared to other encodings under this scenario.

Our results demonstrate that caution is needed when using the additive and dominant encodings for interactions, due to the inflation observed, and that the recessive encoding lacks sufficient power to detect interaction signal. Additionally, the codominant encoding exhibited reduced power for some scenarios when the MAF was less than or equal to 30%. We demonstrated the utility of our novel encoding method at flexibly assigning action to individual loci, robustly identifying interactions between SNPs with diverse action, and uncovering examples of genetic interaction in complex disease.

121 CHAPTER 6

CONCLUSIONS

122 This dissertation described the rationale for embracing complexity beyond genome-wide association studies (GWAS) as well as three specific areas to explore: connections across the phenome, the exposome and gene-environment (GxE) interactions, and gene-gene (GxG) interactions. Certainly, GWAS has its place as a tool for describing the proportion of trait heritability that can be explained by the effect of individual loci, and it has seen considerable success for some traits. Yet, the author predicts that observed skew in associations toward small effect sizes will continue, even as sample sizes increase and sequencing methods become more common. If the environment and epistasis are not considered, the yet unexplained heritability

(which is very large for most common traits) will likely remain elusive.

Investigators exploring these areas face many challenges, however, as explained in

Chapter 1. One of the largest hurdles is finding the true signal through the noise when considering all pairwise combinations between genetic loci or between genetic loci and exposures. This challenge has been addressed in Chapter 3 of this dissertation with the use of an environment- wide association study (EWAS) as a filter of exposures for subsequent GxE interaction to predict type 2 diabetes (T2D). Using this approach, the exposure, Alcohol 30 Day Frequency, was found to interact with a missense single nucleotide polymorphism (SNP) in GANC, a known drug target for T2D. This finding, if validated, would further elucidate the complex interaction between lifestyle and genetics underlying this condition. Future investigation in model organisms will allow for additional verification as well as insights into the mechanisms of this finding.

A biology-based filtering method, Biofilter, was employed to reduce the number of interaction tests when exploring epistasis in age-related cataract, as described in Chapter 4. This method enables investigation of SNP pairs that are biologically linked that but may not demonstrate a main effect. Using this approach, several genetic interaction models replicated across two separate datasets. One of these findings was for two interacting SNPs, one in the intron of CDH1 (encoding E-cadherin) and the other in the intron of CTNNB1 (encoding β- catenin). These factors are known to be involved in processes that include cell-to-cell adhesion

123 signaling, cell-cell junction organization, and cell-cell communication. Each of these mechanisms has been implicated in lens development, lens maintenance, and in cataract development; however, this is the first evidence of these specific factors being involved in age-related cataract.

Perhaps this is due to the fact that most genetic studies for this phenotype have not explored interactions.

Another challenge to interaction analysis is that the current method for coding genetic action is too restrictive. Evidence described in Chapter 5 indicates that the results obtained when using four different encoding methods (additive, dominant, recessive, and codominant) will vary greatly, especially at the GxG interaction level. Further evidence indicates that the recessive method is underpowered to detect most interaction models and that the additive and dominant encodings have inflated false positive rates for models with no simulated interaction. The codominant method suffers from loss of power due to complete separation at low allele frequencies. Results demonstrate that the weighted encoding is flexible across numerous underlying genetic models, while maintaining a low false positive rate. This novel encoding method exhibits promise for accurately assigning SNP-specific encoding for robust GxG interaction detection. Further work will involve testing this method with GxE interactions.

An additional need for exploring complexity beyond GWAS is consideration of multiple phenotypes. This will allow connections between genetic variation and multiple phenotypes

(pleiotropy) to be revealed. Phenome-wide association studies (PheWAS) enable interrogation of multiple SNPs and phenotypes simultaneously. Chapter 2 describes a PheWAS performed using

80 SNPs and over 1,000 phenotypes in the National Health and Nutrition Examination Surveys

(NHANES). Thirteen SNPs demonstrated pleiotropy and further complex biological networks were discovered. For example, two PheWAS-significant SNPs mapped to genes involved in the

KEGG biological process “glycerolipid metabolism”. One SNP (mapped to LPL) was associated with HDL-C levels and the other (mapped to LIPC) with folate levels. Plasma folate levels have been previously associated with lipoprotein profiles. Further, LPL is also involved in the KEGG

124 pathway “Peroxisome Proliferator-Activated Receptor (PPAR) signaling pathway”, along with a

SNP in APOA5, which was associated with triglyceride levels. PPARs are transcription factors activated by lipids. When we employ network approaches like those described in Chapter 2, we can highlight both the biological context that supports results found in PheWAS and the biological annotation that may identify relationships to develop new hypotheses about the connection between genetic variation and complex outcomes.

It is not expected that any single one of the methods described in this dissertation, in and of themselves, will explain the missing heritability of complex traits or that they will singlehandedly enable disease prediction for every person. Rather, it is through combining these methods, along with other sophisticated tools for complexity, that we will hope to explain the etiology of common traits. As a first step, the mechanisms that underlie pleiotropy may be elucidated by incorporating tools that explore epistasis and gene-environment interactions, like those described in this dissertation. Tyler et al. (2009) argue that pleiotropy and epistasis are ubiquitous and related processes. Identifying mutations leading to multiple, seemingly disparate outcomes demonstrates the deeply connected and complex nature of biology220. Pleiotropy may reveal epistatic mechanisms influencing distinct traits220.

The dissertation’s author asserts that pleiotropy could be due to variations in environmental exposures as well. As a hypothetical example, let us say that a single locus confers risk for two discrepant phenotype, hypothyroidism and age-related cataract, due to modulations from two different exposures: iodine intake and UV exposure, respectively (Figure 6.1A). Figures

6.1B,C&D display analytic methods to investigate this hypothesis. A similar approach can be used to test the aforementioned theory by Tyler et al., where instead of exposures, other SNPs can be tested for interactions with the pleiotropic loci. Of course, it is likely that the true underlying mechanism behind divergent phenotypes arising from a single mutation is neither epistasis nor

GxE interactions alone. Rather, complex and connected networks of multiple influencing genetic and environmental factors are likely to interact in different manners that lead to distinct

125 outcomes. Integrating the methods described in the chapters of this dissertation will allow investigators to interrogate the genetic and environmental interaction networks underlying multiple diseases simultaneously.

A B

C D

Figure 6.1. Integration of PheWAS and EWAS to uncover gene-environment interplay underlying pleiotropy. A. Theoretical example of environmental influence (iodine intake and UV exposure) of a single genetic locus on two discrepant outcomes (hypothyroidism and age-related cataract). To investigate evidence of environment’s impact on pleiotropy the following steps can be taken: PheWAS can be employed (B) to identify SNPs demonstrating evidence of pleiotropy (C), and comprehensive test for interaction between all (or a filtered set) of exposures can be used to reveal any exposures that modulate the outcome of pleiotropic SNPs.

There are other opportunities to improve upon the methods described in this dissertation.

For instance, these studies exclusively considered common SNPs as the genetic components, as these data were originally collected for use in GWAS. As explained in Chapter 1, GWAS are designed to detect signal from common variants because this was where the heritability of common traits was thought to lie, according to the Common Disease/Common Variant

126 hypothesis. It is the author’s belief that the reality is not so simple. Instead, it is likely that the true genetic etiology of common diseases involves multiple rare and common variants, through main effects and interactions.

Further, these methods do not allow for heterogeneity: disruptions in separate systems leading to the same outcome. Due to the complex system that is biology, there are countless combinations of alterations possible in different groups of individuals that could all lead to the same phenotype. Let us consider age-related cataract as an example. Three heterogeneous mechanisms are associated with this complex trait (Figure 6.2). First, eight genes involved in the

EGFR pathway were implicated in our Biofilter analysis of age-related cataract (described in

Chapter 4), including EGFR, EGF, E-cadherin, and β-catenin. Figure 6.2A shows a cartoon image of the EGFR pathway, which involves many interconnected components. We identified interactions between variations in the genes encoding EGFR and EGF and in E-cadherin and β- catenin, and it is possible that there are other factors in this pathway that are involved in this disease as well. Theoretically, disruptions due to common or rare base pair mutations or structural variation at any essential region of a gene (or regulatory region) in any part of the pathway could lead to age-related cataract. A second mechanism associated with age-related cataract was identified by Pendergrass et al. (2013): an interaction between smoking at least 100 cigarettes in a lifetime and rs7529518, a SNP in the intron of calmodulin regulated spectirn-associated protein family, member 2 (CAMSAP2)101 (Figure 6.2B). Finally, UV exposure is found to be associated with age-cataract221 (21617534). These examples are not comprehensive and do not include all known genetic and environmental risk factors for this condition; they are included to demonstrate the multitude of distinct ways in which the same outcome can occur. Further research is required to determine whether age-related cataract arises independently in different individuals as a result of these mechanisms in isolation or due to the aggregation of these risk factors.

127

Figure 6.2. Three heterogeneous mechanisms associated with age-related cataract. A) Image of the EGFR pathway (source: http://www.sinobiological.com/EGFR-Signaling-Pathway-a-1567.html). There are multiple factors involved in this pathway and opportunities for disruption. B) An interaction between cigarette smoking and a SNP in CAMSAP2 has been implicated in age-related cataract. C) UV exposure is a well-established risk factor for age-related cataract. Whether these heterogeneous mechanisms act in aggregate or individually is unknown.

The author believes that the etiology of common traits is elusive largely because of heterogeneity. It is certainly possible that many risk variables work in aggregate for outcomes.

However, it is the author’s working hypothesis, that each individual has a unique combination of genetic variation and environmental/lifestyle variation that act together, and through interactions, as risk or protection for each common disease. If true, and the reality is this complex, there is need for even more sophisticated techniques that embrace this individual heterogeneity.

Gene-based methods have the potential to overcome the challenge of heterogeneity.

These tools consider the genetic effect of variants across an entire gene, region, or pathway.

BioBin42 is one such example. Simply, BioBin combines the number of minor alleles (for

128 simplicity, we will assume that the minor allele is always the risk allele) present in a gene, region, or pathway (called a “bin”) for each individual. If a bin is found to have a significantly greater number of risk alleles in cases than controls, the bin is found to be associated with the outcome.

This method does not rely on for which loci risk alleles are present in the cases, only that they are present somewhere in the bin. As an example, BioBin can be used to bin variants in all of a given gene (Figure 6.3). If each individual case has a risk allele at a different locus, the gene will be detected if controls have a significantly fewer number of risk alleles across the gene. In the example shown in Figure 6.3, though each case has a risk allele at a different locus, the number of risk alleles across the exons in cases is greater than that of controls. Using GWAS, the signal would be dispersed across the gene and any single locus would have a very low probability of being detected. This approach can be used at the pathway level as well, which allows detection of multiple distinct disruptions across different genes within the same pathway that could each contribute to disease status, and this would be missed if looking at each locus on its own.

Figure 6.3. Use of BioBin to combine variants across genes. Four exons of a hypothetical gene are shown for 3 cases (blue) and 3 controls (red). Each green line represents a locus of variation. For simplicity, every variant is heterozygous (presence of only one risk allele; note: any genotype that is homozygous for the risk allele would count as 2 risk alleles). When the variants in each sample’s bin are added, the cases together have 6 risk alleles in this gene while the controls, together, have 2. If this is a statistically significant difference, the gene is considered as associated with the outcome.

Theoretically, BioBin can be applied to GxG interactions, which would allow detection of multiple variants within interacting genes that lead to the same outcome in different individuals. This method has the potential for robust GxE discovery as well, and could further be used to bin copy number variant data across genes and pathways. BioBin is designed for, and has

129 been established as, a tool to handle rare variants. In theory, it can be applied to common SNPs as well; however, linkage disequilibrium and opposing directions of effect could wash signal out.

Incorporating the weighted encoding with BioBin may improve the predictive power of the method while still allowing for heterogeneity. Currently in BioBin, presence of the heterozygous genotype at a locus will add a score of 1 and homozygous risk allele genotype adds a score of 2, which assumes the additive model (note that the dominant and recessive models are also available features in BioBin). However, as we have seen in Chapter 5, genetic action can vary across different SNPs. We can use the weighted encoding to accurately assign action to each SNP and bin the variants across a gene or pathway with a weighted value.

Genome-wide association studies have provided a foundation from which to explore the genetic components of common traits. To further explain these phenotypes, methods embracing complexity beyond GWAS are necessary. The methods described in this dissertation represent a step toward modeling complexity by embracing phenotypic connections, the exposome and gene-environment interactions, and genetic interactions. However, precise, individualized medical prediction, risk prevention, and treatment are the goal. Further, elucidating complex gene-environment interactions will allow greater efforts toward protecting vulnerable populations. To achieve these goals, it is essential to integrate methods like those discussed here so as to detect genotype-environment-phenotype networks while allowing for heterogeneity, which is likely to further elucidate the complex underpinnings of common human disease.

130 REFERENCES

1. Kerem, B. et al. Identification of the cystic fibrosis gene: genetic analysis. Science 245, 1073–80 (1989).

2. Riordan, J. R. et al. Identification of the cystic fibrosis gene: cloning and characterization of complementary DNA. Science 245, 1066–1073 (1989).

3. Rommens, J. M. et al. Identification of the cystic fibrosis gene: chromosome walking and jumping. Science 245, 1059–1065 (1989).

4. Klein, R. J. et al. Complement factor H polymorphism in age-related macular degeneration. Science 308, 385–389 (2005).

5. Lander, E. S. et al. Initial sequencing and analysis of the human genome. Nature 409, 860–921 (2001).

6. A haplotype map of the human genome. Nature 437, 1299–1320 (2005).

7. Reich, D. E. & Lander, E. S. On the allelic spectrum of human disease. Trends in Genetics 17, 502–510 (2001).

8. Smith, D. J. & Lusis, A. J. The allelic structure of common disease. Hum. Mol. Genet. 11, 2455–2461 (2002).

9. Collins, F. S., Guyer, M. S. & Charkravarti, A. Variations on a theme: cataloging human DNA sequence variation. Science 278, 1580–1581 (1997).

10. Welter, D. et al. The NHGRI GWAS Catalog, a curated resource of SNP-trait associations. Nucleic Acids Res. 42, (2014).

11. Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. U. S. A. 106, 9362– 9367 (2009).

12. Hindorff, L. A. et al. Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc. Natl. Acad. Sci. 106, 9362–9367 (2009).

13. Manolio, T. A. et al. Finding the missing heritability of complex diseases. Nature 461, 747–753 (2009).

14. Manolio, T. a. Bringing genome-wide association findings into clinical use. Nat. Rev. Genet. 14, 549–58 (2013).

15. Florez, J. C. Newly identified loci highlight beta cell dysfunction as a key cause of type 2 diabetes: Where are the insulin resistance genes? Diabetologia 51, 1100–1110 (2008).

131 16. Visscher, P. M., Brown, M. A., McCarthy, M. I. & Yang, J. Five years of GWAS discovery. American Journal of Human Genetics 90, 7–24 (2012).

17. Postmus, I. et al. Pharmacogenetic meta-analysis of genome-wide association studies of LDL cholesterol response to statins. Nat. Commun. 5, 5068 (2014).

18. Hirschhorn, J. N. Genomewide association studies--illuminating biologic pathways. N. Engl. J. Med. 360, 1699–1701 (2009).

19. Wang, W. Y. S., Barratt, B. J., Clayton, D. G. & Todd, J. A. Genome-wide association studies: theoretical and practical concerns. Nat. Rev. Genet. 6, 109–18 (2005).

20. Maher, B. Personal genomes: The case of the missing heritability. Nature 456, 18–21 (2008).

21. Moore, J. H., Asselbergs, F. W. & Williams, S. M. Bioinformatics challenges for genome- wide association studies. Bioinformatics 26, 445–55 (2010).

22. Goldstein, D. B. Common genetic variation and human traits. N. Engl. J. Med. 360, 1696– 8 (2009).

23. Donnelly, P. Progress and challenges in genome-wide association studies in humans. Nature 456, 728–731 (2008).

24. Ritchie, M. D. Finding the epistasis needles in the genome-wide haystack. Methods Mol. Biol. 1253, 19–33 (2015).

25. Solovieff, N., Cotsapas, C., Lee, P. H., Purcell, S. M. & Smoller, J. W. Pleiotropy in complex traits: challenges and strategies. Nat. Rev. Genet. 14, 483–495 (2013).

26. Sivakumaran, S. et al. Abundant pleiotropy in human complex diseases and traits. Am. J. Hum. Genet. 89, 607–618 (2011).

27. White, J. K. et al. Genome-wide generation and systematic phenotyping of knockout mice reveals new roles for many genes. Cell 154, 452–64 (2013).

28. McGuigan, K. et al. The nature and extent of mutational pleiotropy in gene expression of male Drosophila serrata. Genetics 196, 911–21 (2014).

29. Tassabehji, M., Newton, V. E. & Read, A. P. Waardenburg syndrome type 2 caused by mutations in the human microphthalmia (MITF) gene. Nat. Genet. 8, 251–255 (1994).

30. Liu, M. et al. Identification of a CRYAB mutation associated with autosomal dominant posterior polar cataract in a Chinese family. Investig. Ophthalmol. Vis. Sci. 47, 3461–3466 (2006).

31. Inagaki, N. et al. αB-crystallin mutation in dilated cardiomyopathy. Biochem. Biophys. Res. Commun. 342, 379–386 (2006).

132 32. Hall, M. A. et al. Detection of pleiotropy through a phenome-wide association study (PheWAS) of epidemiologic data as part of the Environmental Architecture for Genes Linked to Environment (EAGLE) Study. PLoS Genet. 10, (2014).

33. Denny, J. C. et al. PheWAS: demonstrating the feasibility of a phenome-wide scan to discover gene–disease associations. Bioinformatics 26, 1205–1210 (2010).

34. Denny, J. C. et al. Variants near FOXE1 are associated with hypothyroidism and other thyroid conditions: Using electronic medical records for genome- and phenome-wide studies. Am. J. Hum. Genet. 89, 529–542 (2011).

35. Denny, J. C. et al. Systematic comparison of phenome-wide association study of electronic medical record data and genome-wide association study data. Nat. Biotechnol. 31, 1102–1111 (2013).

36. Hebbring, S. J. et al. A PheWAS approach in studying HLA-DRB1*1501. Genes Immun. 14, 187–91 (2013).

37. Namjou, B. et al. Phenome-wide association study (PheWAS) in EMR-linked pediatric cohorts, genetically links PLCL1 to speech language development and IL5-IL13 to Eosinophilic Esophagitis. Front. Genet. 5, 401 (2014).

38. Cronin, R. M. et al. Phenome-wide association studies demonstrating pleiotropy of genetic variants within FTO with and without adjustment for body mass index. Front. Genet. 5, 1– 14 (2014).

39. Pendergrass, S. A. et al. The Use of Phenome-Wide Association Studies (PheWAS) for Exploration of Novel Genotype-Phenotype Relationships and Pleiotropy Discovery. Genet. Epidemiol. 35, 410–422 (2011).

40. Moore, C. B. et al. Phenome-wide Association Study Relating Pretreatment Laboratory Parameters With Human Genetic Variants in AIDS Clinical Trials Group Protocols. Open forum Infect. Dis. 2, ofu113 (2015).

41. Moore, C. B. et al. Low Frequency Variants, Collapsed Based on Biological Knowledge, Uncover Complexity of Population Stratification in 1000 Genomes Project Data. PLoS Genet. 9, (2013).

42. Moore, C. B., Wallace, J. R., Frase, A. T., Pendergrass, S. a & Ritchie, M. D. Using biobin to explore rare variant population stratification. Pac. Symp. Biocomput. 332–43 (2013). at

43. Wu, M. C. et al. Powerful SNP-Set Analysis for Case-Control Genome-wide Association Studies. Am. J. Hum. Genet. 86, 929–942 (2010).

44. Mitchell, S. L. et al. Investigating the relationship between mitochondrial genetic variation and cardiovascular-related traits to develop a framework for mitochondrial phenome-wide association studies. BioData Min. 7, 6 (2014).

133 45. Hamilton, C. M. et al. The PhenX Toolkit: get the most from your measures. Am. J. Epidemiol. 174, 253–60 (2011).

46. Patel, C. J. & Ioannidis, J. P. A. Studying the elusive environment in large scale. JAMA 311, 2173–4 (2014).

47. Rappaport, S. M. & Smith, M. T. Epidemiology. Environment and disease risks. Science 330, 460–1 (2010).

48. Wild, C. P. Complementing the genome with an ‘exposome’: the outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiol. Biomarkers Prev. 14, 1847–50 (2005).

49. Kelly, S. P., Thornton, J., Edwards, R., Sahu, A. & Harrison, R. Smoking and cataract: review of causal association. J. Cataract Refract. Surg. 31, 2395–404 (2005).

50. Taylor, H. R. et al. Effect of ultraviolet radiation on cataract formation. N. Engl. J. Med. 319, 1429–33 (1988).

51. Volk, H. E., Hertz-Picciotto, I., Delwiche, L., Lurmann, F. & McConnell, R. Residential proximity to freeways and autism in the CHARGE study. Environ. Health Perspect. 119, 873–7 (2011).

52. Volk, H. E., Lurmann, F., Penfold, B., Hertz-Picciotto, I. & McConnell, R. Traffic-related air pollution, particulate matter, and autism. JAMA psychiatry 70, 71–7 (2013).

53. Patel, C. J., Bhattacharya, J. & Butte, A. J. An Environment-Wide Association Study (EWAS) on type 2 diabetes mellitus. PLoS One 5, e10746 (2010).

54. Lind, P. M., Risérus, U., Salihovic, S., Bavel, B. van & Lind, L. An environmental wide association study (EWAS) approach to the metabolic syndrome. Environ. Int. 55, 1–8 (2013).

55. Patel, C. J., Chen, R., Kodama, K., Ioannidis, J. P. A. & Butte, A. J. Systematic identification of interaction effects between genome- and environment-wide associations in type 2 diabetes mellitus. Hum. Genet. 132, 495–508 (2013).

56. Patel, C. J. et al. Systematic evaluation of environmental and behavioural factors associated with all-cause mortality in the United States national health and nutrition examination survey. Int. J. Epidemiol. 42, 1795–810 (2013).

57. Patel, C. J. et al. Investigation of maternal environmental exposures in association with self-reported preterm birth. Reprod. Toxicol. 45, 1–7 (2014).

58. Hall, M. A. et al. Environment-wide association study (EWAS) for type 2 diabetes in the Marshfield Personalized Medicine Research Project Biobank. Pac. Symp. Biocomput. 200–11 (2014). at

134 59. Tzoulaki, I. et al. A nutrient-wide association study on blood pressure. Circulation 126, 2456–64 (2012).

60. Davis, M. A. et al. A dietary-wide association study (DWAS) of environmental metal exposure in US children and adults. PLoS One 9, e104768 (2014).

61. Caspi, A. et al. Influence of life stress on depression: moderation by a polymorphism in the 5-HTT gene. Science 301, 386–389 (2003).

62. Vineis, P. et al. Current smoking, occupation, N-acetyltransferase-2 and bladder cancer: A pooled analysis of genotype-based studies. Cancer Epidemiol. Biomarkers Prev. 10, 1249–1252 (2001).

63. Yu, M. C., Skipper, P. L., Tannenbaum, S. R., Chan, K. K. & Ross, R. K. Arylamine exposures and bladder cancer risk. Mutat. Res. - Fundam. Mol. Mech. Mutagen. 506-507, 21–28 (2002).

64. Martinez, E. Multi-protein complexes in eukaryotic gene transcription. Plant Mol. Biol. 50, 925–947 (2002).

65. Gallie, D. R. Protein-protein interactions required during translation. Plant Mol. Biol. 50, 949–970 (2002).

66. Bateson. Mendel’s Principles of Heredityle. (Cambridge University Press, 1909).

67. Moore, J. H. The ubiquitous nature of epistasis in determining susceptibility to common human diseases. in Human Heredity 56, 73–82 (2003).

68. Templeton. Epistasis and complex traits. (Oxford University Press, 2000).

69. Mackay, T. F. C. Epistasis and quantitative traits: using model organisms to study gene- gene interactions. Nat. Rev. Genet. 15, 22–33 (2014).

70. Wagner, A. Robustness against mutations in genetic networks of yeast. Nat. Genet. 24, 355–361 (2000).

71. Boone, C., Bussey, H. & Andrews, B. J. Exploring genetic interactions and networks with yeast. Nat. Rev. Genet. 8, 437–449 (2007).

72. Tong, A. H. Y. et al. Global mapping of the yeast genetic interaction network. Science 303, 808–813 (2004).

73. Szappanos, B. et al. An integrated approach to characterize genetic interaction networks in yeast metabolism. Nat. Genet. 43, 656–662 (2011).

74. Moore, J. H. A global view of epistasis. Nature genetics 37, 13–14 (2005).

135 75. Baryshnikova, A., Costanzo, M., Myers, C. L., Andrews, B. & Boone, C. Genetic interaction networks: toward an understanding of heritability. Annu. Rev. Genomics Hum. Genet. 14, 111–33 (2013).

76. Lehner, B., Crombie, C., Tischler, J., Fortunato, A. & Fraser, A. G. Systematic mapping of genetic interactions in Caenorhabditis elegans identifies common modifiers of diverse signaling pathways. Nat. Genet. 38, 896–903 (2006).

77. Gaertner, B. E., Parmenter, M. D., Rockman, M. V., Kruglyak, L. & Phillips, P. C. More than the sum of its parts: A complex epistatic network underlies natural variation in thermal preference behavior in Caenorhabditis elegans. Genetics 192, 1533–1542 (2012).

78. Byrne, A. B. et al. A global analysis of genetic interactions in Caenorhabditis elegans. J. Biol. 6, 8 (2007).

79. Horn, T. et al. Mapping of signaling networks through synthetic genetic interaction analysis by RNAi. Nat. Methods 8, 341–346 (2011).

80. Huang, W. et al. Inaugural Article: Epistasis dominates the genetic architecture of Drosophila quantitative traits. Proceedings of the National Academy of Sciences 109, 15553–15559 (2012).

81. Lloyd, V., Ramaswami, M. & Krämer, H. Not just pretty eyes: Drosophila eye-colour mutations and lysosomal delivery. Trends in Cell Biology 8, 257–259 (1998).

82. Cheng, Y. et al. Mapping genetic loci that interact with myostatin to affect growth traits. Heredity 107, 565–573 (2011).

83. Hanlon, P. et al. Three-locus and four-locus QTL interactions influence mouse insulin-like growth factor-I. Physiol. Genomics 26, 46–54 (2006).

84. Gale, G. D. et al. A genome-wide panel of congenic mice reveals widespread epistasis of behavior quantitative trait loci. Mol. Psychiatry 14, 631–645 (2009).

85. Rowe, H. C., Hansen, B. G., Halkier, B. A. & Kliebenstein, D. J. Biochemical networks and epistasis shape the Arabidopsis thaliana metabolome. Plant Cell 20, 1199–1216 (2008).

86. Kroymann, J. & Mitchell-Olds, T. Epistasis and balanced polymorphism influencing complex trait variation. Nature 435, 95–98 (2005).

87. Nelson, M. R., Kardia, S. L. R., Ferrell, R. E. & Sing, C. F. A combinatorial partitioning method to identify multilocus genotypic partitions that predict quantitative trait variation. Genome Res. 11, 458–470 (2001).

88. Ritchie, M. D. et al. Multifactor-dimensionality reduction reveals high-order interactions among estrogen-metabolism genes in sporadic breast cancer. Am. J. Hum. Genet. 69, 138– 147 (2001).

136 89. Hemani, G. et al. Detection and replication of epistasis influencing transcription in humans. Nature 508, 249–53 (2014).

90. Sun, Y. V. Integration of biological networks and pathways with genetic association studies. Hum. Genet. 131, 1677–86 (2012).

91. Ritchie, M. D. Using biological knowledge to uncover the mystery in the search for epistasis in genome-wide association studies. Ann. Hum. Genet. 75, 172–182 (2011).

92. Ma, L., Keinan, A. & Clark, A. G. Biological Knowledge-Driven Analysis of Epistasis in Human GWAS with Application to Lipid Traits. Methods Mol. Biol. 1253, 35–45 (2015).

93. Sun, X. et al. Analysis pipeline for the epistasis search - statistical versus biological filtering. Frontiers in Genetics 5, (2014).

94. Thomas, D. C. et al. Use of pathway information in molecular epidemiology. Hum. Genomics 4, 21–42 (2009).

95. Kim, D. et al. Knowledge-driven genomic interactions: an application in ovarian cancer. BioData Min. 7, 20 (2014).

96. Wang, Z., Xu, W., San Lucas, F. A. & Liu, Y. Incorporating prior knowledge into Gene Network Study. Bioinformatics 29, 2633–40 (2013).

97. Bush, W. S. et al. A knowledge-driven interaction analysis reveals potential neurodegenerative mechanism of multiple sclerosis susceptibility. Genes Immun. 12, 335– 40 (2011).

98. Ritchie, M. D. Using prior knowledge and genome-wide association to identify pathways involved in multiple sclerosis. Genome Med. 1, 65 (2009).

99. Grady, B. J. et al. Finding unique filter sets in PLATO: a precursor to efficient interaction analysis in GWAS data. Pac. Symp. Biocomput. 315–26 (2010). at

100. Hall, M. A. et al. Biology-Driven Gene-Gene Interaction Analysis of Age-Related Cataract in the eMERGE Network. Genet. Epidemiol. 39, 376–84 (2015).

101. Pendergrass, S. A. et al. Next-generation analysis of cataracts: determining knowledge driven gene-gene interactions using biofilter, and gene-environment interactions using the Phenx Toolkit*. Pac. Symp. Biocomput. 495–505 (2015). at

102. Turner, S. D. et al. Knowledge-driven multi-locus analysis reveals gene-gene interactions influencing HDL cholesterol level in two independent EMR-linked biobanks. PLoS One 6, e19586 (2011).

137 103. Pendergrass, S. A. et al. Phenome-Wide Association Study (PheWAS) for Detection of Pleiotropy within the Population Architecture using Genomics and Epidemiology (PAGE) Network. PLoS Genet 9, e1003087 (2013).

104. Matise, T. C. et al. The Next PAGE in understanding complex traits: design for the analysis of Population Architecture Using Genetics and Epidemiology (PAGE) Study. Am. J. Epidemiol. 174, 849–859 (2011).

105. Wolfe, D., Dudek, S., Ritchie, M. D. & Pendergrass, S. A. Visualizing genomic information across with PhenoGram. BioData Min. 6, 18 (2013).

106. Pendergrass, S. A. et al. Genomic analyses with biofilter 2.0: knowledge driven filtering, annotation, and model development. BioData Min. 6, 25 (2013).

107. Bush, W. S., Dudek, S. M. & Ritchie, M. D. Biofilter: a knowledge-integration system for the multi-locus analysis of genome-wide association studies. Pac. Symp. Biocomput. 368– 379 (2009).

108. Kanehisa, M. & Goto, S. KEGG: kyoto encyclopedia of genes and genomes. Nucleic Acids Res. 28, 27–30 (2000).

109. Ashburner, M. et al. Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat. Genet. 25, 25–29 (2000).

110. Kandasamy, K. et al. NetPath: a public resource of curated signal transduction pathways. Genome Biol. 11, R3 (2010).

111. Smoot, M. E., Ono, K., Ruscheinski, J., Wang, P. L. & Ideker, T. Cytoscape 2.8: new features for data integration and network visualization. Bioinformatics 27, 431–2 (2011).

112. McQuillan, G. M., Pan, Q. & Porter, K. S. Consent for genetic research in a general population: an update on the National Health and Nutrition Examination Survey experience. Genet. Med. Off. J. Am. Coll. Med. Genet. 8, 354–360 (2006).

113. Dumitrescu, L. et al. Genetic determinants of lipid traits in diverse populations from the population architecture using genomics and epidemiology (PAGE) study. PLoS Genet. 7, e1002138 (2011).

114. Dumitrescu, L. et al. Variation in LPA is associated with Lp(a) levels in three populations from the Third National Health and Nutrition Examination Survey. PLoS One 6, e16604 (2011).

115. Keebler, M. E. et al. Association of blood lipids with common DNA sequence variants at 19 genetic loci in the multiethnic United States National Health and Nutrition Examination Survey III. Circ. Cardiovasc. Genet. 2, 238–243 (2009).

116. Aulchenko, Y. S. et al. Loci influencing lipid levels and coronary heart disease risk in 16 European population cohorts. Nat. Genet. 41, 47–55 (2009).

138 117. Sabatti, C. et al. Genome-wide association analysis of metabolic traits in a birth cohort from a founder population. Nat. Genet. 41, 35–46 (2009).

118. Kathiresan, S. et al. Six new loci associated with blood low-density lipoprotein cholesterol, high-density lipoprotein cholesterol or triglycerides in humans. Nat. Genet. 40, 189–197 (2008).

119. Waterworth, D. M. et al. Genetic variants influencing circulating lipid levels and risk of coronary artery disease. Arterioscler. Thromb. Vasc. Biol. 30, 2264–2276 (2010).

120. Kathiresan, S. et al. Common variants at 30 loci contribute to polygenic dyslipidemia. Nat. Genet. 41, 56–65 (2009).

121. Bolibar, I., von Eckardstein, A., Assmann, G., Thompson, S. & ECAT Angina Pectoris Study Group. European Concerted Action on Thrombosis and Disabilities. Short-term prognostic value of lipid measurements in patients with angina pectoris. The ECAT Angina Pectoris Study Group: European Concerted Action on Thrombosis and Disabilities. Thromb. Haemost. 84, 955–960 (2000).

122. Castelli, W. P. et al. Incidence of coronary heart disease and lipoprotein cholesterol levels. The Framingham Study. JAMA J. Am. Med. Assoc. 256, 2835–2838 (1986).

123. Köttgen, A. et al. Genome-wide association analyses identify 18 new loci associated with serum urate concentrations. Nat. Genet. 45, 145–154 (2013).

124. Karns, R. et al. Genome-wide association of serum uric acid concentration: replication of sequence variants in an island population of the Adriatic coast of Croatia. Ann. Hum. Genet. 76, 121–127 (2012).

125. Kolz, M. et al. Meta-analysis of 28,141 individuals identifies common variants within five new loci that influence uric acid concentrations. PLoS Genet. 5, e1000504 (2009).

126. Dehghan, A. et al. Association of three genetic loci with uric acid concentration and risk of gout: a genome-wide association study. Lancet 372, 1953–1961 (2008).

127. Schunkert, H. et al. Large-scale association analysis identifies 13 new susceptibility loci for coronary artery disease. Nat. Genet. 43, 333–338 (2011).

128. Willer, C. J. et al. Newly identified loci that influence lipid concentrations and risk of coronary artery disease. Nat. Genet. 40, 161–169 (2008).

129. Myocardial Infarction Genetics Consortium et al. Genome-wide association of early-onset myocardial infarction with single nucleotide polymorphisms and copy number variants. Nat. Genet. 41, 334–341 (2009).

130. Pierce, B. L. et al. C-reactive protein, interleukin-6, and prostate cancer risk in men aged 65 years and older. Cancer causes Control CCC 20, 1193–1203 (2009).

139 131. Walston, J. D. et al. IL-6 gene variation is associated with IL-6 and C-reactive protein levels but not cardiovascular outcomes in the Cardiovascular Health Study. Hum. Genet. 122, 485–494 (2007).

132. Vickers, M. A. et al. Genotype at a promoter polymorphism of the interleukin-6 gene is associated with baseline levels of plasma C-reactive protein. Cardiovasc. Res. 53, 1029– 1034 (2002).

133. Richards, J. B. et al. Bone mineral density, osteoporosis, and osteoporotic fractures: a genome-wide association study. Lancet 371, 1505–1512 (2008).

134. Franceschini, N. et al. Discovery and fine mapping of serum protein loci through transethnic meta-analysis. Am. J. Hum. Genet. 91, 744–753 (2012).

135. Osman, W. et al. Association of common variants in TNFRSF13B, TNFSF13, and ANXA3 with serum levels of non-albumin protein and immunoglobulin isotypes in Japanese. PLoS One 7, e32683 (2012).

136. Gieger, C. et al. New gene functions in megakaryopoiesis and platelet formation. Nature 480, 201–208 (2011).

137. Middelberg, R. P. S. et al. Genetic variants in LPL, OASL and TOMM40/APOE-C1-C2- C4 genes are associated with multiple cardiovascular-related traits. BMC Med. Genet. 12, 123 (2011).

138. Dehghan, A. et al. Meta-analysis of genome-wide association studies in >80 000 subjects identifies multiple loci for C-reactive protein levels. Circulation 123, 731–738 (2011).

139. Teslovich, T. M. et al. Biological, clinical and population relevance of 95 loci for blood lipids. Nature 466, 707–713 (2010).

140. Köttgen, A. et al. New loci associated with kidney function and chronic kidney disease. Nat. Genet. 42, 376–384 (2010).

141. Chambers, J. C. et al. Genome-wide association study identifies loci influencing concentrations of liver enzymes in plasma. Nat. Genet. 43, 1131–1138 (2011).

142. Lemaitre, R. N. et al. Genetic loci associated with plasma phospholipid n-3 fatty acids: a meta-analysis of genome-wide association studies from the CHARGE Consortium. PLoS Genet. 7, e1002193 (2011).

143. Eijgelsheim, M. et al. Genome-wide association analysis identifies multiple loci related to resting heart rate. Hum. Mol. Genet. 19, 3885–3894 (2010).

144. Illig, T. et al. A genome-wide perspective of genetic variation in human metabolism. Nat. Genet. 42, 137–141 (2010).

145. Ghio, A. J., Ford, E. S., Kennedy, T. P. & Hoidal, J. R. The association between serum ferritin and uric acid in humans. Free Radic. Res. 39, 337–342 (2005).

140 146. Reiner, A. P. et al. Genome-wide association study of white blood cell count in 16,388 African Americans: the continental origins and genetic epidemiology network (COGENT). PLoS Genet. 7, e1002108 (2011).

147. Reich, D. et al. Reduced neutrophil count in people of African descent is due to a regulatory variant in the Duffy antigen receptor for chemokines gene. PLoS Genet. 5, e1000360 (2009).

148. Rowe, J. W., Tobin, J. D., Rosa, R. M. & Andres, R. Effect of experimental potassium deficiency on glucose and insulin metabolism. Metabolism. 29, 498–502 (1980).

149. Bennink, H. J. & Schreurs, W. H. Improvement of oral glucose tolerance in gestational diabetes by pyridoxine. Br. Med. J. 3, 13–5 (1975).

150. Spellacy, W. N., Buhi, W. C. & Birk, S. A. Vitamin B6 treatment of gestational diabetes mellitus: studies of blood glucose and plasma insulin. Am. J. Obstet. Gynecol. 127, 599– 602 (1977).

151. Semmler, A. et al. Plasma folate levels are associated with the lipoprotein profile: a retrospective database analysis. Nutr. J. 9, 31 (2010).

152. Baecke, J. A. H., Burema, J. & Frijters, J. E. R. A short questionnaire for the measurement of habitual physical activity in epidemiological studies. Am. J. Clin. Nutr. 36, 936–942 (1982).

153. Subar, A. F. et al. Comparative validation of the Block, Willett, and National Cancer Institute food frequency questionnaires : the Eating at America’s Table Study. Am. J. Epidemiol. 154, 1089–99 (2001).

154. Thompson, F. E. et al. Cognitive research enhances accuracy of food frequency questionnaire reports: results of an experimental validation study. J. Am. Diet. Assoc. 102, 212–25 (2002).

155. Kohane, I. S. Using electronic health records to drive discovery in disease genomics. Nat. Rev. Genet. 12, 417–28 (2011).

156. McCarty, C. A. et al. The eMERGE Network: a consortium of biorepositories linked to electronic medical records data for conducting genomic studies. BMC Med. Genomics 4, 13 (2011).

157. McCarty, C. A., Wilke, R. A., Giampietro, P. F., Wesbrook, S. D. & Caldwell, M. D. Marshfield Clinic Personalized Medicine Research Project (PMRP): design, methods and recruitment for a large population-based biobank. Per. Med. 2, 49–79 (2005).

158. CfDCaP, C. National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Questionnaire (or Examination Protocol, or Laboratory Protocol) Department of Health and Human Services, Centers for Disease Control and Prevention. Hyattsville, MD

141 159. Kho, A. N. et al. Use of diverse electronic medical record systems to identify genetic risk for type 2 diabetes within a genome-wide association study. J. Am. Med. Inform. Assoc. 19, 212–8

160. Haiman, C. A. et al. Consistent directions of effect for established type 2 diabetes risk variants across populations: the population architecture using Genomics and Epidemiology (PAGE) Consortium. Diabetes 61, 1642–7 (2012).

161. Zuvich, R. L. et al. Pitfalls of merging GWAS data: lessons learned in the eMERGE network and quality control procedures to maintain high data quality. Genet. Epidemiol. 35, 887–98 (2011).

162. Van Dam, R. M., Willett, W. C., Manson, J. E. & Hu, F. B. Coffee, caffeine, and risk of type 2 diabetes: a prospective cohort study in younger and middle-aged U.S. women. Diabetes Care 29, 398–403 (2006).

163. Carlsson, S., Hammar, N., Grill, V. & Kaprio, J. Alcohol consumption and the incidence of type 2 diabetes: a 20-year follow-up of the Finnish twin cohort study. Diabetes Care 26, 2785–90 (2003).

164. Pietraszek, A., Gregersen, S. & Hermansen, K. Alcohol and type 2 diabetes. A review. Nutr. Metab. Cardiovasc. Dis. 20, 366–75 (2010).

165. Standl, E. & Schnell, O. Alpha-glucosidase inhibitors 2012 - cardiovascular considerations and trial evaluation. Diabetes and Vascular Disease Research 9, 163–169 (2012).

166. Willi, C., Bodenmann, P., Ghali, W. A., Faris, P. D. & Cornuz, J. Active smoking and the risk of type 2 diabetes: a systematic review and meta-analysis. JAMA 298, 2654–64 (2007).

167. Yeh, H.-C., Duncan, B. B., Schmidt, M. I., Wang, N.-Y. & Brancati, F. L. Smoking, smoking cessation, and risk for type 2 diabetes mellitus: a cohort study. Ann. Intern. Med. 152, 10–7 (2010).

168. Xie, X., Liu, Q., Wu, J. & Wakui, M. Impact of cigarette smoking in type 2 diabetes development. Acta Pharmacol. Sin. 30, 784–7 (2009).

169. Hu, G. et al. Occupational, commuting, and leisure-time physical activity in relation to risk for Type 2 diabetes in middle-aged Finnish men and women. Diabetologia 46, 322–9 (2003).

170. Laaksonen, D. E. et al. Physical activity in the prevention of type 2 diabetes: the Finnish diabetes prevention study. Diabetes 54, 158–65 (2005).

171. Helmrich, S. P., Ragland, D. R., Leung, R. W. & Paffenbarger, R. S. Physical activity and reduced occurrence of non-insulin-dependent diabetes mellitus. N. Engl. J. Med. 325, 147–52 (1991).

142 172. Nielsen, J. V & Joensson, E. A. Low-carbohydrate diet in type 2 diabetes: stable improvement of bodyweight and glycemic control during 44 months follow-up. Nutr. Metab. (Lond). 5, 14 (2008).

173. Black, A. & Wood, J. Vision and falls. Clin. Exp. Optom. 88, 212–222 (2005).

174. Ellwein, L. B. & Urato, C. J. Use of eye care and associated charges among the Medicare population: 1991-1998. Arch. Ophthalmol. (Chicago, Ill. 1960) 120, 804–11 (2002).

175. Hejtmancik, J. F. & Kantorow, M. Molecular genetics of age-related cataract. Exp. Eye Res. 79, 3–9 (2004).

176. Ritchie, M. D. et al. Electronic medical records and genomics (eMERGE) network exploration in cataract: several new potential susceptibility loci. Mol. Vis. 20, 1281–95 (2014).

177. Asbell, P. A. et al. Age-related cataract. Lancet (London, England) 365, 599–609

178. Chong, C. C. W., Stump, R. J. W., Lovicu, F. J. & McAvoy, J. W. TGFbeta promotes Wnt expression during cataract development. Exp. Eye Res. 88, 307–13 (2009).

179. Martinez, G. & de Iongh, R. U. The lens epithelium in ocular health and disease. Int. J. Biochem. Cell Biol. 42, 1945–63 (2010).

180. Cho, H. J., Baek, K. E., Saika, S., Jeong, M.-J. & Yoo, J. Snail is required for transforming growth factor-beta-induced epithelial-mesenchymal transition by activating PI3 kinase/Akt signal pathway. Biochem. Biophys. Res. Commun. 353, 337–43 (2007).

181. Bao, X., Song, H., Chen, Z. & Tang, X. Wnt3a promotes epithelial-mesenchymal transition, migration, and proliferation of lens epithelial cells. Mol. Vis. 18, 1983–90 (2012).

182. Gottesman, O. et al. The Electronic Medical Records and Genomics (eMERGE) Network: past, present, and future. Genet. Med. 15, 761–71 (2013).

183. Crawford, D. C. et al. eMERGEing progress in genomics-the first seven years. Front. Genet. 5, 184 (2014).

184. Peissig, P. L. et al. Importance of multi-modal approaches to effectively identify cataract cases from electronic health records. J. Am. Med. Inform. Assoc. 19, 225–34

185. McCarty, C. A., Chapman-Stone, D., Derfus, T., Giampietro, P. F. & Fost, N. Community consultation and communication for a population-based DNA biobank: the Marshfield clinic personalized medicine research project. Am. J. Med. Genet. A 146A, 3026–33 (2008).

186. Roden, D. M. et al. Development of a large-scale de-identified DNA biobank to enable personalized medicine. Clin. Pharmacol. Ther. 84, 362–9 (2008).

143 187. Abecasis, G. R. et al. An integrated map of genetic variation from 1,092 human genomes. Nature 491, 56–65 (2012).

188. Verma, S. S. et al. Imputation and quality control steps for combining multiple genome- wide datasets. Front. Genet. 5, 370 (2014).

189. Howie, B. N., Donnelly, P. & Marchini, J. A flexible and accurate genotype imputation method for the next generation of genome-wide association studies. PLoS Genet. 5, e1000529 (2009).

190. Delaneau, O., Zagury, J.-F. & Marchini, J. Improved whole-chromosome phasing for disease and population genetic studies. Nat. Methods 10, 5–6 (2013).

191. Purcell, S. et al. PLINK: a tool set for whole-genome association and population-based linkage analyses. Am. J. Hum. Genet. 81, 559–75 (2007).

192. Ogata, H. et al. KEGG: Kyoto Encyclopedia of Genes and Genomes. Nucleic Acids Res. 27, 29–34 (1999).

193. Matthews, L. et al. Reactome knowledgebase of human biological pathways and processes. Nucleic Acids Res. 37, D619–22 (2009).

194. Punta, M. et al. The Pfam protein families database. Nucleic Acids Res. 40, D290–301 (2012).

195. Stark, C. et al. The BioGRID Interaction Database: 2011 update. Nucleic Acids Res. 39, D698–704 (2011).

196. Licata, L. et al. MINT, the molecular interaction database: 2012 update. Nucleic Acids Res. 40, D857–61 (2012).

197. Sherry, S. T. et al. dbSNP: the NCBI database of genetic variation. Nucleic Acids Res. 29, 308–11 (2001).

198. Price, A. L. et al. Principal components analysis corrects for stratification in genome-wide association studies. Nat. Genet. 38, 904–9 (2006).

199. Saito, R. et al. A travel guide to Cytoscape plugins. Nat. Methods 9, 1069–76 (2012).

200. Liu, X., Yu, X., Zack, D. J., Zhu, H. & Qian, J. TiGER: a database for tissue-specific gene expression and regulation. BMC Bioinformatics 9, 271 (2008).

201. Yu, X., Lin, J., Zack, D. J. & Qian, J. Computational analysis of tissue-specific combinatorial gene regulation: predicting interaction between transcription factors in human tissues. Nucleic Acids Res. 34, 4925–36 (2006).

202. Goodenough, D. A. The crystalline lens. A system networked by gap junctional intercellular communication. Semin. Cell Biol. 3, 49–58 (1992).

144 203. Benedek, G. B. Theory of transparency of the eye. Appl. Opt. 10, 459–73 (1971).

204. Donaldson, P., Kistler, J. & Mathias, R. T. Molecular solutions to mammalian lens transparency. News Physiol. Sci. 16, 118–23 (2001).

205. Cooper, M. A. et al. Loss of ephrin-A5 function disrupts lens fiber cell packing and leads to cataract. Proc. Natl. Acad. Sci. U. S. A. 105, 16620–5 (2008).

206. Gong, X. et al. Disruption of alpha3 connexin gene leads to proteolysis and cataractogenesis in mice. Cell 91, 833–43 (1997).

207. White, T. W., Goodenough, D. A. & Paul, D. L. Targeted ablation of connexin50 in mice results in microphthalmia and zonular pulverulent cataracts. J. Cell Biol. 143, 815–25 (1998).

208. Wei, C.-J., Xu, X. & Lo, C. W. Connexins and cell signaling in development and disease. Annu. Rev. Cell Dev. Biol. 20, 811–38 (2004).

209. White, T. W. & Paul, D. L. Genetic diseases and gene knockouts reveal diverse connexin functions. Annu. Rev. Physiol. 61, 283–310 (1999).

210. Pontoriero, G. F. et al. Co-operative roles for E-cadherin and N-cadherin during lens vesicle separation and lens epithelial cell survival. Dev. Biol. 326, 403–17 (2009).

211. Perez-Moreno, M., Jamora, C. & Fuchs, E. Sticky business: orchestrating cellular signals at adherens junctions. Cell 112, 535–48 (2003).

212. Nusse, R. Wnt signaling in disease and in development. Cell Res. 15, 28–32 (2005).

213. Logan, C. Y. & Nusse, R. The Wnt signaling pathway in development and disease. Annu. Rev. Cell Dev. Biol. 20, 781–810 (2004).

214. Apple, D. J. et al. Posterior capsule opacification. Surv. Ophthalmol. 37, 73–116

215. Awasthi, N., Guo, S. & Wagner, B. J. Posterior capsular opacification: a problem reduced but not yet eradicated. Arch. Ophthalmol. (Chicago, Ill. 1960) 127, 555–62 (2009).

216. Yao, K., Ye, P. P., Tan, J., Tang, X. J. & Shen Tu, X. C. Involvement of PI3K/Akt pathway in TGF-beta2-mediated epithelial mesenchymal transition in human lens epithelial cells. Ophthalmic Res. 40, 69–76 (2008).

217. Rowan, S. et al. Notch signaling regulates growth and differentiation in the mammalian lens. Dev. Biol. 321, 111–22 (2008).

218. Huang, W.-R., Fan, X.-X. & Tang, X. SiRNA targeting EGFR effectively prevents posterior capsular opacification after cataract surgery. Mol. Vis. 17, 2349–55 (2011).

219. Sun, C., Zhao, M. & Li, X. CFB/C2 gene polymorphisms and risk of age-related macular degeneration: a systematic review and meta-analysis. Curr. Eye Res. 37, 259–71 (2012).

145 220. Tyler, A. L., Asselbergs, F. W., Williams, S. M. & Moore, J. H. Shadows of complexity: what biological networks reveal about epistasis and pleiotropy. Bioessays 31, 220–7 (2009).

221. Roberts, J. E. Ultraviolet radiation as a risk factor for cataract and macular degeneration. Eye Contact Lens 37, 246–249 (2011).

146

APPENDIX

147 Appendix Table 2.1. 80 SNPs Chosen for NHANES PheWAS. For each SNP, the previously published trait association (with PubMedID), genotyping method, context, gene(s) gene the SNP is in or nearest, distance from nearest gene, regulatory information, and whether SNP has previously exhibited pleiotropy in the GWAS catalog are included. Publications How far GWAS Reason Previously Published Used in Original In What Gene Nearest away? Regulatory Catalog SNP Chosen for Phenotype (NHGRI GWAS Choice to Genotyping Method Context Gene (BF) gene (base ? Pleiotrop Genotyping Catalog) Genotype ? pairs) y? (PubMed ID) LDL rs10401969 LDL cholesterol PMID:19060906 Sequenom intron Yes SUGP1 N/A N/A 1f yes cholesterol Age-related Age-related macular PMID:16642439 rs10490924 macular Sequenom missense Yes ARMS2 N/A N/A no no degeneration PMID:16174643 degeneration HDL rs10503669 HDL cholesterol PMID:18852197 Sequenom Intergenic No N/A LPL 22920 6 yes cholesterol Age-related Age-related macular PMID:15761122 rs1061170 macular Sequenom missense Yes CFH N/A N/A no no degeneration PMID:16174643 degeneration Myocardial CDKN2B- rs10757278 Myocardial infarction PMID:17478679 Sequenom Intergenic No N/A 3381 4 no infarction AS1 C-reactive rs10778213 C-reactive protein PMID:18852197 Sequenom Intergenic No N/A PAH 183770 5 no protein Type 2 CDKN2B- rs10811661 Type 2 diabetes PMID:18176561 Sequenom Intergenic No N/A 12998 no yes diabetes AS1 rs10889353 Triglycerides Cholesterol, total PMID:19060906 Sequenom intron Yes DOCK7 N/A N/A 1f yes Type 2 rs10923931 Type 2 diabetes PMID:18852197 Sequenom intron Yes NOTCH2 N/A N/A no no diabetes Body mass rs10938397 Body mass index PMID:19079261 Sequenom Intergenic No N/A GNPDA2 453876 no no index Body mass rs11084753 Body mass index PMID:19079261 Sequenom Intergenic No N/A KCTD15 15471 6 no index LDL rs11206510 Coronary heart disease PMID:18852197 Sequenom Intergenic No N/A PCSK9 9110 no yes cholesterol Taqman (Already in CDC NHANES III C-reactive database from rs1205 PMID:17101857 UTR-3 Yes CRP N/A N/A 4 no Protein PMID:17101857); Sequenom (NHANES 1999-2002) Metabolic syndrome (bivariate rs12286037 Triglycerides PMID:18852197 Sequenom intron Yes ZNF259 N/A N/A no yes traits) HDL rs12596776 PMID:18193043 Sequenom intron Yes SLC12A3 N/A N/A 2b no cholesterol Sequenom (Already in CDC NHANES III database from rs1260326 Triglycerides Metabolite levels PMID:19060906 missense Yes GCKR N/A N/A no yes PMID:20031591); Sequenom (NHANES 1999-2002)

148 HDL rs12678919 HDL cholesterol PMID:19060906 Sequenom Intergenic No N/A LPL 19452 no yes cholesterol LDL rs12695382 PMID:18193043 Sequenom intron Yes B4GALT4 N/A N/A no no cholesterol LDL rs12740374 Coronary heart disease PMID:19060906 Sequenom UTR-3 Yes CELSR2 N/A N/A 4 yes cholesterol Sequenom (Already in CDC NHANES III HDL database from rs1323432 PMID:18193043 intron Yes GRIN3A N/A N/A 6 no cholesterol PMID:20031591); Sequenom (NHANES 1999-2002) Type 2 rs13266634 Type 2 diabetes and other traits PMID:18852197 Sequenom cds-synon Yes SLC30A8 N/A N/A 4 yes diabetes Coronary CDKN2B- rs1333049 Coronary artery calcification PMID:17634449 Sequenom Intergenic No N/A 4407 no no artery disease AS1 LDL rs1501908 LDL cholesterol PMID:19060906 Sequenom Intergenic No N/A TIMD4 7903 no no cholesterol Sequenom (Already in CDC NHANES III LDL database from rs1529729 PMID:18193044 intron Yes SMARCA4 N/A N/A 4 no cholesterol PMID:20031591); Sequenom (NHANES 1999-2002) HDL rs1566439 PMID:18193043 Sequenom Intergenic No N/A CETP 6906 no no cholesterol PMID:18193043 rs16996148 Triglycerides LDL cholesterol Sequenom Intergenic No N/A CILP2 1004 1f yes PMID:18193044 nearGene- rs17145738 Triglycerides HDL cholesterol PMID:18193043 Sequenom No N/A TBL2 400 5 yes 3 rs17216525 Triglycerides Triglycerides PMID:19060906 Sequenom Intergenic No N/A CILP2 4752 5 no

rs174547 HDL-C Lipid metabolism phenotypes PMID:19060906 Sequenom intron Yes FADS1 N/A N/A 6 yes

rs1748195 Triglycerides Triglycerides PMID:18852197 Sequenom intron Yes DOCK7 N/A N/A 6 no Body mass rs17782313 Height PMID:18454148 Sequenom Intergenic No N/A MC4R 1875467 no yes index Sequenom (Already in CDC NHANES III PMID:18193044 database from nearGene- rs1800588 HDL-C HDL cholesterol No N/A LIPC 500 4 no PMID:18354102 PMID:20031591); 5 Sequenom (NHANES 1999-2002) C-reactive nearGene- LOC54147 rs1800795 PMID:17851695 Sequenom Yes N/A N/A 4 no protein 5 2 rs1800961 HDL-C C-reactive protein PMID:19060906 Sequenom missense Yes HNF4A N/A N/A 3a yes

rs1883025 HDL-C Metabolic syndrome PMID:19060906 Sequenom intron Yes ABCA1 N/A N/A 5 yes

rs2144300 HDL-C HDL cholesterol PMID:18852197 Sequenom intron Yes GALNT2 N/A N/A 4 yes

149 Metabolic syndrome (bivariate rs2197089 HDL-C PMID:18193043 Sequenom Intergenic No N/A LPL 1603 6 no traits) Illumina GoldenGate (Already in CDC NHANES III database rs2228671 LDL-C Cholesterol, total PMID:18262040 cds-synon Yes LDLR N/A N/A 4 yes from Dr. Crawford); Sequenom (NHANES 1999-2002) PMID:18834626 STOP- rs2231142 Gout Uric acid levels Sequenom Yes ABCG2 N/A N/A 5 no PMID:19503597 GAIN Type 2 rs2237895 Type 2 diabetes PMID:18711366 Sequenom intron Yes KCNQ1 N/A N/A 5 no diabetes rs2338104 HDL-C HDL cholesterol PMID:18852197 Sequenom intron Yes KCTD10 N/A N/A 6 no Myocardial CDKN2B- rs2383208 Type 2 diabetes PMID:17478679 Sequenom Intergenic No N/A 10980 no no infarction AS1 rs2650000 LDL-C LDL cholesterol PMID:19060906 Sequenom Intergenic No N/A HNF1A 27587 1f yes

rs28927680 Triglycerides Triglycerides PMID:18193044 Sequenom UTR-3 Yes BUD13 N/A N/A no no Sequenom (Already in CDC NHANES III database from rs3135506 HDL-C PMID:18193043 missense Yes APOA5 N/A N/A 2b no PMID:20031591); Sequenom (NHANES 1999-2002) Sequenom (Already in CDC NHANES III database from STOP- rs328 HDL-C Triglycerides PMID:18354102 Yes LPL N/A N/A 5 yes PMID:20031591); GAIN Sequenom (NHANES 1999-2002) Bone mineral PMID:18349089 rs3736228 Bone mineral density Sequenom missense Yes LRP5 N/A N/A 1f no density PMID:18455228 rs3764261 HDL-C Lipid metabolism phenotypes PMID:18852197 Sequenom Intergenic No N/A CETP 2511 no yes Sequenom (Already in CDC NHANES III PMID:18193044 database from rs3890182 HDL-C HDL cholesterol intron Yes ABCA1 N/A N/A 5 no PMID:18354102 PMID:20031591); Sequenom (NHANES 1999-2002) rs3905000 HDL-C MRI atrophy measures PMID:19060911 Sequenom intron Yes ABCA1 N/A N/A no yes

rs4149268 HDL-C HDL cholesterol PMID:18852197 Sequenom intron Yes ABCA1 N/A N/A 6 no Bone mineral TNFRSF11 rs4355801 Bone mineral density PMID:18455228 Sequenom Intergenic No N/A 11923 no no density B PMID:18193043 nearGene- rs4420638 LDL-C Cognitive decline Sequenom No N/A APOC1 340 no yes PMID:17554300 3 Type 2 ADAMTS9- rs4607103 Type 2 diabetes PMID:18852197 Sequenom Intergenic Yes N/A N/A 6 no diabetes AS2

150 Type 2 rs4712523 Type 2 diabetes and other traits PMID:18852197 Sequenom intron Yes CDKAL1 N/A N/A 4 no diabetes rs471364 HDL-C HDL cholesterol PMID:19060906 Sequenom intron Yes TTC39B N/A N/A 1f no

rs4939883 HDL-C Cholesterol, total PMID:19060906 Sequenom Intergenic No N/A LIPG 47936 5 yes

rs515135 LDL-C LDL cholesterol PMID:19060906 Sequenom Intergenic No N/A APOB 19112 no no

rs560887 Glucose Metabolic syndrome PMID:18451265 Sequenom intron Yes G6PC2 N/A N/A no yes

rs562338 LDL-C LDL cholesterol PMID:18852197 Sequenom Intergenic No N/A APOB 21376 no no

rs6102059 LDL-C LDL cholesterol PMID:19060906 Sequenom Intergenic No N/A MAFB 85371 no no nearGene- rs646776 LDL-C Coronary heart disease PMID:18262040 Sequenom No N/A CELSR2 152 1f yes 3 rs6544713 LDL-C LDL cholesterol PMID:19060906 Sequenom intron Yes ABCG8 N/A N/A no no Body mass rs6548238 Body mass index PMID:19079261 Sequenom Intergenic No N/A TMEM18 33068 4 no index rs6586891 HDL-C PMID:18193043 Sequenom Intergenic No N/A LPL 89828 no no

Illumina GoldenGate (Already in CDC NHANES III database rs673548 Triglycerides Metabolic syndrome PMID:19060911 intron Yes APOB N/A N/A 5 no from Dr. Crawford); Sequenom (NHANES 1999-2002) missense; rs6756629 LDL-C Cholesterol, total PMID:19060911 Sequenom nearGene- Yes ABCG5 N/A N/A no yes 5 rs6855911 Uric acid Urate levels PMID:17997608 Sequenom intron Yes SLC2A9 N/A N/A no no Illumina GoldenGate (Already in CDC PMID:18193044 NHANES III database rs693 LDL-C Cholesterol, total cds-synon Yes APOB N/A N/A 5 no PMID:18262040 from Dr. Crawford); Sequenom (NHANES 1999-2002) rs714052 Triglycerides Hypertriglyceridemia PMID:19060906 Sequenom intron Yes BAZ1B N/A N/A 6 no PMID:18179892 rs7442295 Uric acid Urate levels PMID:17997608 Sequenom intron Yes SLC2A9 N/A N/A no no PMID:18327256 Body mass rs7498665 Body mass index PMID:19079261 Sequenom missense Yes SH2B1 N/A N/A 1f no index rs7557067 Triglycerides Triglycerides PMID:19060906 Sequenom Intergenic No N/A APOB 16090 4 no Body mass rs7566605 PMID:17465681 Sequenom Intergenic No N/A INSIG2 10025 no no index Type 2 rs7578597 Type 2 diabetes PMID:18852197 Sequenom missense Yes THADA N/A N/A 6 no diabetes rs7679 HDL-C HDL cholesterol PMID:19060906 Sequenom UTR-3 Yes PCIF1 N/A N/A 1f yes

151 rs780094 Triglycerides Metabolic syndrome PMID:18678614 Sequenom intron Yes GCKR N/A N/A 3a yes Type 2 rs7961581 Type 2 diabetes PMID:18852197 Sequenom Intergenic No N/A TSPAN8 111323 5 no diabetes Type 2 rs864745 Type 2 diabetes PMID:18852197 Sequenom intron Yes JAZF1 N/A N/A no yes diabetes nearGene- rs964184 HDL-C Metabolic syndrome PMID:19060906 Sequenom No N/A ZNF259 359 1f yes 3

152

Appendix Table 2.2. PheWAS replication of associations from the GWAS Catalog.

Previously Reported Closest Phenotype (NHGRI Race/ Sample Gene(s) SNP GWAS Catalog) Pubmed ID Phenotype Class Ethinicity Study Phenotype Description P-value Beta Size (ln + 1)TGP:Triglyceride (mg/dL), 0.00010617, 0.10505(0.02706), 2572, TGP:Triglyceride (mg/dL), (ln + 0.000321933, 20.32377(5.64257), 2572, 1)TGPSI:Triglyceride (mmol/L), 7.75e-05, 0.06632(0.01675), 2572, TGPSI:Triglyceride (mmol/L), (ln + 0.000320775, 0.22952(0.06371), 2572, 1)TGP:Serumtriglycerides(mg/dL), 0.00010617, 0.10505(0.02706), 2572, TGP:Serumtriglycerides(mg/dL), (ln + 0.000321933, 20.32377(5.64257), 2572, 1)TGPSI:Serumtriglycerides:SI(mmol/L), 7.75e-05, 0.06632(0.01675), 2572, NHANES III TGPSI:Serumtriglycerides:SI(mmol/L) 0.000320775 0.22952(0.06371) 2572 (ln + 1)LBDSTRSI:Triglycerides 0.05305(0.01277), 3942, (mmol/L), LBDSTRSI:Triglycerides 0.18976(0.04861), 3942, LPL - Triglycerides, HDL 21909109, (mmol/L), (ln + 1)LBDTRSI:Triglyceride 3.34e-05, 9.65e- 0.0507(0.01796), 1943, rs10503669 Triglycerides NHW RPL30P9 cholesterol 18193043 (mmol/L), (ln + 1)LBXSTR:Triglycerides 05, 0.00481151, 0.0818(0.02071), 3942, (mg/dL), LBXSTR:Triglycerides (mg/dL), 7.97e-05, 9.64e- 16.808(4.30595), 3942, NHANES 9902 (ln + 1)LBXTR:Triglyceride (mg/dL) 05, 0.00631701 0.0785(0.02871) 1943 (ln + 1)TGP:Triglyceride (mg/dL), 4515, TGP:Triglyceride (mg/dL), (ln + 4515, 1)TGPSI:Triglyceride (mmol/L), 4515, TGPSI:Triglyceride (mmol/L), (ln + 3.37e-06, 2.11e- 4515, 1)TGP:Serumtriglycerides(mg/dL), 05, 1.86e-06, 0.09185, 18.6425, 4515, TGP:Serumtriglycerides(mg/dL), (ln + 2.1e-05, 3.37e- 0.05858, 0.21052, 4515, NHANES 1)TGPSI:Serumtriglycerides:SI(mmol/L), 06, 2.11e-05, 0.09185, 18.6425, 4515, combined TGPSI:Serumtriglycerides:SI(mmol/L) 1.86e-06, 2.1e-05 0.05858, 0.21052 4515 (ln + 1)TGP:Triglyceride (mg/dL), (ln + 0.000822449, 0.09185(0.02741), 1826, 1)TGPSI:Triglyceride (mmol/L), (ln + 0.001541255, 0.05459(0.01721), 1826, 1)TGP:Serumtriglycerides(mg/dL), (ln + 0.000822449, 0.09185(0.02741), 1826, NHANES III 1)TGPSI:Serumtriglycerides:SI(mmol/L) 0.001541255 0.05459(0.01721) 1820 (ln + 1)LBDSTRSI:Triglycerides 0.000846825, 0.05916(0.0177), (mmol/L), LBDSTRSI:Triglycerides 0.000125534, 0.29322(0.0763), 1862, (mmol/L), (ln + 1)LBDTRSI:Triglyceride 0.000506543, 0.0832(0.02384), 1862, Lipoprotein-associated (mmol/L), LBDTRSI:Triglyceride 3.36e-05, 0.41176(0.0988), 1862, phospholipase A2 (mmol/L), (ln + 1)LBXSTR:Triglycerides 0.00147989, 0.0889(0.02793), 1862, activity and mass, 20442857, (mg/dL), LBXSTR:Triglycerides (mg/dL), 0.000125505, 25.9721(6.75788), 1862, ZNF259 rs12286037 Metabolic syndrome 21386085, Triglycerides MA (ln + 1)LBXTR:Triglyceride (mg/dL), 0.0017486, 0.11678(0.0372), 1892, (bivariate traits), 18193043 NHANES 9902 LBXTR:Triglyceride (mg/dL) 3.41e-05 36.4459(8.75118) 936, 936, Triglycerides(bivariate traits), Triglycerides (ln + 1)TGP:Triglyceride (mg/dL), 2762, TGP:Triglyceride (mg/dL), (ln + 2762, 1)TGPSI:Triglyceride (mmol/L), 4.05e-06, 6.62e- 2762, TGPSI:Triglyceride (mmol/L), (ln + 06, 3.12e-06, 2762, 1)TGP:Serumtriglycerides(mg/dL), 6.55e-06, 4.05e- 0.1023, 21.7428, 2762, TGP:Serumtriglycerides(mg/dL), (ln + 06, 6.62e-06, 0.06539, 0.2456, 2762, NHANES 1)TGPSI:Serumtriglycerides:SI(mmol/L), 3.12e-06, 6.55e- 0.1023, 21.7428, 2762, combined TGPSI:Serumtriglycerides:SI(mmol/L) 06 0.06539, 0.2456 2762 (ln + 1)TGP:Triglyceride (mg/dL), 0.13875(0.03122), 2441, Lipoprotein-associated TGP:Triglyceride (mg/dL), (ln + 31.46352(6.46084), 2441, phospholipase A2 1)TGPSI:Triglyceride (mmol/L), 9.2e-06, 1.19e- 0.08989(0.0193), 2441, 20442857, activity and mass, TGPSI:Triglyceride (mmol/L), (ln + 06, 3.39e-06, 0.35519(0.07295), 2441, ZNF259 rs12286037 21386085, Triglycerides NHW Metabolic syndrome 1)TGP:Serumtriglycerides(mg/dL), 1.19e-06, 9.2e- 0.13875(0.03122), 2441, 18193043 (bivariate traits), TGP:Serumtriglycerides(mg/dL), (ln + 06, 1.19e-06, 31.46352(6.46084), 2441, Triglycerides 1)TGPSI:Serumtriglycerides:SI(mmol/L), 3.39e-06, 1.19e- 0.08989(0.0193), 2441, NHANES III TGPSI:Serumtriglycerides:SI(mmol/L) 06 0.35519(0.07295) 2441

153 (ln + 1)LBDSTRSI:Triglycerides 0.07663(0.01459), 3969, (mmol/L), LBDSTRSI:Triglycerides 0.27459(0.05573), 3969, (mmol/L), (ln + 1)LBDTRSI:Triglyceride 1.59e-07, 8.67e- 0.06539(0.02085), 1958, (mmol/L), (ln + 1)LBXSTR:Triglycerides 07, 0.00173261, 0.11642(0.02364), 3969, (mg/dL), LBXSTR:Triglycerides (mg/dL), 8.82e-07, 8.67e- 24.3218(4.93593), 3969, NHANES 9902 (ln + 1)LBXTR:Triglyceride (mg/dL) 07, 0.00234567 0.10136(0.03327) 1958 (ln + 1)TGP:Triglyceride (mg/dL), 4399, TGP:Triglyceride (mg/dL), (ln + 4399, 1)TGPSI:Triglyceride (mmol/L), 4399, TGPSI:Triglyceride (mmol/L), (ln + 8.34e-08, 2.38e- 4399, 1)TGP:Serumtriglycerides(mg/dL), 07, 2.55e-08, 0.12241, 26.2642, 4399, TGP:Serumtriglycerides(mg/dL), (ln + 2.4e-07, 8.34e- 0.07911, 0.29648, 4399, NHANES 1)TGPSI:Serumtriglycerides:SI(mmol/L), 08, 2.38e-07, 0.12241, 26.2642, 4399, combined TGPSI:Serumtriglycerides:SI(mmol/L) 2.55e-08, 2.4e-07 0.07911, 0.29648 4399 (ln + 1)HDP:HDL-Cholesterol (mg/dL), HDP:HDL-Cholesterol (mg/dL), (ln + 1)HDPSI:Serum HDL cholesterol: SI 0.000732781, 0.04743(0.01403), 2426, (mmol/L), HDPSI:Serum HDL cholesterol: 0.00142037, 2.41331(0.75554), 2426, SI (mmol/L), (ln + 1)HDP:HDL- 0.000860652, 0.0266(0.00797), 2426, Cholesterol(mg/dL), HDP:HDL- 0.001425767, 0.0624(0.01954), 2426, Cholesterol(mg/dL), (ln + 0.000732781, 0.04743(0.01403), 2426, 1)HDPSI:SerumHDLcholesterol:SI(mmol/ 0.00142037, 2.41331(0.75554), 2426, L), 0.000860652, 0.0266(0.00797), 2426, NHANES III HDPSI:SerumHDLcholesterol:SI(mmol/L) 0.001425767 0.0624(0.01954) 2426 (ln + 1)LBDHDL:Serum HDL-Cholesterol (mg/dL), LBDHDL:Serum HDL- 0.0023089, 0.03336(0.01094), 3963, LPL - HDL cholesterol, 20686565, rs12678919 HDL Cholesterol NHW Cholesterol (mg/dL), (ln + 0.0090238, 1.5749(0.60285), 3963, RPL30P9 Triglycerides 19060906 1)LBDHDLSI:HDL-cholesterol (mmol/L), 0.00421098, 0.01816(0.00634), 3963, NHANES 9902 LBDHDLSI:HDL-cholesterol (mmol/L) 0.00965209 0.04035(0.01558) 3963 (ln + 1)HDP:HDL-Cholesterol (mg/dL), HDP:HDL-Cholesterol (mg/dL), (ln + 1)HDPSI:Serum HDL cholesterol: SI 6389, (mmol/L), HDPSI:Serum HDL cholesterol: 6389, SI (mmol/L), (ln + 1)HDP:HDL- 7.45e-06, 6.07e- 6389, Cholesterol(mg/dL), HDP:HDL- 05, 1.73e-05, 6389, Cholesterol(mg/dL), (ln + 6.58e-05, 7.45e- 0.03875, 1.89498, 6389, 1)HDPSI:SerumHDLcholesterol:SI(mmol/ 06, 6.07e-05, 0.02138, 0.04876, 6389, NHANES L), 1.73e-05, 6.58e- 0.03875, 1.89498, 6389, combined HDPSI:SerumHDLcholesterol:SI(mmol/L) 05 0.02138, 0.04876 6389 (ln + 1)TGP:Triglyceride (mg/dL), 0.000709204, -0.09236(0.02724), 2441, TGP:Triglyceride (mg/dL), (ln + 0.001191365, -18.3206(5.6462), - 2441, 1)TGPSI:Triglyceride (mmol/L), 0.000479255, 0.05892(0.01685), 2441, TGPSI:Triglyceride (mmol/L), (ln + 0.001184798, -0.20695(0.06375), 2441, 1)TGP:Serumtriglycerides(mg/dL), 0.000709204, -0.09236(0.02724), 2441, TGP:Serumtriglycerides(mg/dL), (ln + 0.001191365, -18.3206(5.6462), - 2441, 1)TGPSI:Serumtriglycerides:SI(mmol/L), 0.000479255, 0.05892(0.01685), 2441, NHANES III TGPSI:Serumtriglycerides:SI(mmol/L) 0.001184798 -0.20695(0.06375) 2441 (ln + 1)LBDSTRSI:Triglycerides 0.0011416, -0.04206(0.01292), 3962, LPL - HDL cholesterol, 20686565, (mmol/L), LBDSTRSI:Triglycerides 0.00273167, -0.14801(0.04936), 3962, rs12678919 Triglycerides NHW RPL30P9 Triglycerides 19060906 (mmol/L), (ln + 1)LBXSTR:Triglycerides 0.00185068, -0.06519(0.02092), 3962, NHANES 9902 (mg/dL), LBXSTR:Triglycerides (mg/dL) 0.00272934 -13.111(4.37241) 3962 (ln + 1)TGP:Triglyceride (mg/dL), 0.000186266, 4396, TGP:Triglyceride (mg/dL), (ln + 0.000252705, 4396, 1)TGPSI:Triglyceride (mmol/L), 9.08e-05, 4396, TGPSI:Triglyceride (mmol/L), (ln + 0.000252148, 4396, 1)TGP:Serumtriglycerides(mg/dL), 0.000186266, -0.07492, -16.3409, 4396, TGP:Serumtriglycerides(mg/dL), (ln + 0.000252705, -0.04881, -0.18453, 4396, NHANES 1)TGPSI:Serumtriglycerides:SI(mmol/L), 9.08e-05, -0.07492, -16.3409, 4396, combined TGPSI:Serumtriglycerides:SI(mmol/L) 0.000252148 -0.04881, -0.18453 4396

154 (ln + 1)LCP:Serum LDL cholesterol (mg/dL), LCP:Serum LDL cholesterol (mg/dL), (ln + 1)LCPSI:Serum LDL 0.001379769, -0.05827(0.01815), cholesterol: SI (mmol/L), LCPSI:Serum 0.000589516, -7.11951(2.06368), LDL cholesterol: SI (mmol/L), (ln + 0.000998652, -0.04399(0.01332), 1)LCP:SerumLDLcholesterol(mg/dL), 0.000585706, -0.18419(0.05336), LCP:SerumLDLcholesterol(mg/dL), (ln + 0.001379769, -0.05827(0.01815), 819, 819, 1)LCPSI:SerumLDLcholesterol:SI(mmol/L 0.000589516, -7.11951(2.06368), 819, 819, ), 0.000998652, -0.04399(0.01332), 819, 819, NHANES III LCPSI:SerumLDLcholesterol:SI(mmol/L) 0.000585706 -0.18419(0.05336) 819, 819 (ln + 1)LBDLDL:LDL-cholesterol (mg/dL), LBDLDL:LDL-cholesterol -0.07624(0.01856), LDL cholesterol, 19060906, CELSR2 rs12740374 LDL Cholesterol MA (mg/dL), (ln + 1)LBDLDLSI:LDL- 4.39e-05, 5.68e- -8.69579(2.14901), Coronary heart disease 21347282 cholesterol (mmol/L), LBDLDLSI:LDL- 05, 4.15e-05, -0.0563(0.01366), - 841, 841, NHANES 9902 cholesterol (mmol/L) 5.72e-05 0.22483(0.05558) 841, 841 (ln + 1)LCP:Serum LDL cholesterol (mg/dL), LCP:Serum LDL cholesterol (mg/dL), (ln + 1)LCPSI:Serum LDL 1660, cholesterol: SI (mmol/L), LCPSI:Serum 1660, LDL cholesterol: SI (mmol/L), (ln + 1660, 1)LCP:SerumLDLcholesterol(mg/dL), 1.06e-07, 5.2e- 1660, LCP:SerumLDLcholesterol(mg/dL), (ln + 08, 6.9e-08, -0.06922, -8.12968, 1660, 1)LCPSI:SerumLDLcholesterol:SI(mmol/L 5.19e-08, 1.06e- -0.05162, -0.21026, 1660, NHANES ), 07, 5.2e-08, 6.9e- -0.06922, -8.12968, 1660, combined LCPSI:SerumLDLcholesterol:SI(mmol/L) 08, 5.19e-08 -0.05162, -0.21026 1660 (ln + 1)LCP:Serum LDL cholesterol (mg/dL), LCP:Serum LDL cholesterol (mg/dL), (ln + 1)LCPSI:Serum LDL 0.002563216, -0.06033(0.01994), cholesterol: SI (mmol/L), LCPSI:Serum 0.004787301, -6.61399(2.33794), LDL cholesterol: SI (mmol/L), (ln + 0.003244466, -0.04306(0.01458), 1)LCP:SerumLDLcholesterol(mg/dL), 0.004751133, -0.17119(0.06046), LCP:SerumLDLcholesterol(mg/dL), (ln + 0.002563216, -0.06033(0.01994), 797, 797, 1)LCPSI:SerumLDLcholesterol:SI(mmol/L 0.004787301, -6.61399(2.33794), 797, 797, ), 0.003244466, -0.04306(0.01458), 797, 797, NHANES III LCPSI:SerumLDLcholesterol:SI(mmol/L) 0.004751133 -0.17119(0.06046) 797, 797 (ln + 1)LBDLDL:LDL-cholesterol (mg/dL), LBDLDL:LDL-cholesterol 0.00161751, -0.06857(0.02164), LDL cholesterol, 19060906, CELSR2 rs12740374 LDL Cholesterol NHB (mg/dL), (ln + 1)LBDLDLSI:LDL- 0.00272804, -7.63866(2.53752), Coronary heart disease 21347282 cholesterol (mmol/L), LBDLDLSI:LDL- 0.0016795, -0.05035(0.01595), 560, 560, NHANES 9902 cholesterol (mmol/L) 0.00273826 -0.19746(0.06562) 560, 560 (ln + 1)LCP:Serum LDL cholesterol (mg/dL), LCP:Serum LDL cholesterol (mg/dL), (ln + 1)LCPSI:Serum LDL 1357, cholesterol: SI (mmol/L), LCPSI:Serum 1357, LDL cholesterol: SI (mmol/L), (ln + 1.39e-05, 4.28e- 1357, 1)LCP:SerumLDLcholesterol(mg/dL), 05, 1.83e-05, 1357, LCP:SerumLDLcholesterol(mg/dL), (ln + 4.26e-05, 1.39e- -0.06417, -7.07872, 1357, 1)LCPSI:SerumLDLcholesterol:SI(mmol/L 05, 4.28e-05, -0.0464, -0.18312, 1357, NHANES ), 1.83e-05, 4.26e- -0.06417, -7.07872, 1357, combined LCPSI:SerumLDLcholesterol:SI(mmol/L) 05 -0.0464, -0.18312 1357 (ln + 1)TGP:Triglyceride (mg/dL), (ln + 0.000884007, -0.05641(0.01695), 2441, Resting heart rate, HDL 20639392, 1)TGPSI:Triglyceride (mmol/L), (ln + 0.001490867, -0.03335(0.01049), 2441, cholesterol, Metabolic 21886157, 1)TGP:Serumtriglycerides(mg/dL), (ln + 0.000884007, -0.05641(0.01695), 2441, traits, Phospholipid 21829377, NHANES III 1)TGPSI:Serumtriglycerides:SI(mmol/L) 0.001490867 -0.03335(0.01049) 2441 FADS1 RS174547 levels (plasma), Triglycerides NHW 20037589, Metabolite levels, (ln + 1)LBDTRSI:Triglyceride (mmol/L), 0.0011003, -0.03799(0.01162), 1951, 19060906, LBDTRSI:Triglyceride (mmol/L), (ln + 0.00504675, -0.14224(0.05067), 1951, Triglycerides, Lipid 22286219 metabolism phenotypes 1)LBXTR:Triglyceride (mg/dL), 0.00177283, -0.05799(0.01853), 1951, NHANES 9902 LBXTR:Triglyceride (mg/dL) 0.00500698 -12.6093(4.48765) 1951

155 (ln + 1)TGP:Triglyceride (mg/dL), 6.76e-06, 4392, TGP:Triglyceride (mg/dL), (ln + 0.000207057, 4392, 1)TGPSI:Triglyceride (mmol/L), 7.17e-06, 4392, TGPSI:Triglyceride (mmol/L), (ln + 0.000209316, 4392, 1)TGP:Serumtriglycerides(mg/dL), 6.76e-06, -0.05643, -10.3688, 4392, TGP:Serumtriglycerides(mg/dL), (ln + 0.000207057, -0.03503, -0.11698, 4392, NHANES 1)TGPSI:Serumtriglycerides:SI(mmol/L), 7.17e-06, -0.05643, -10.3688, 4392, combined TGPSI:Serumtriglycerides:SI(mmol/L) 0.000209316 -0.03503, -0.11698 4392 -0.03456(0.00913), -0.20239(0.05638), 0.000158967, -0.04258(0.01122), 2022, (ln + 1)UAP:Uric acid (mg/dL), UAP:Uric 0.000339324, - 2022, acid (mg/dL), (ln + 1)UAPSI:Uric acid 0.00015269, 12.03717(3.35377), 2022, (umol/L), UAPSI:Uric acid (umol/L), (ln + 0.000339669, -0.03456(0.00913), 2022, 1)UAP:Serumuricacid(mg/dL), 0.000158967, -0.20239(0.05638), 2022, UAP:Serumuricacid(mg/dL), (ln + 0.000339324, -0.04258(0.01122), 2022, 1)UAPSI:Serumuricacid:SI(umol/L), 0.00015269, - 2022, NHANES III UAPSI:Serumuricacid:SI(umol/L) 0.000339669 12.03717(3.35377) 2022 22229870, Uric acid levels, Urate (ln + 1)LBDSUASI:Uric acid (umol/L), 0.000669595, -0.03911(0.01148), 1870, ABCG2 RS2231142 19503597, Uric acid MA levels LBDSUASI:Uric acid (umol/L), (ln + 0.000160101, -13.0228(3.44284), 1870, 18834626 1)LBXSUA:Uric acid (mg/dL), 0.000475213, -0.03316(0.00947), 1870, NHANES 9902 LBXSUA:Uric acid (mg/dL) 0.000160387 -0.21892(0.05788) 1870 3892, (ln + 1)UAP:Uric acid (mg/dL), UAP:Uric 3892, acid (mg/dL), (ln + 1)UAPSI:Uric acid 3.52e-07, 2.56e- 3892, (umol/L), UAPSI:Uric acid (umol/L), (ln + 07, 4.72e-07, 3892, 1)UAP:Serumuricacid(mg/dL), 2.56e-07, 3.52e- -0.03356, -0.20861, 3892, UAP:Serumuricacid(mg/dL), (ln + 07, 2.56e-07, -0.04051, -12.4084, 3892, NHANES 1)UAPSI:Serumuricacid:SI(umol/L), 4.72e-07, 2.56e- -0.03356, -0.20861, 3892, combined UAPSI:Serumuricacid:SI(umol/L) 07 -0.04051, -12.4084 3892

UAPSI:Uric acid (umol/L) 3.72E-06 -17.06285, 2579 UAPSI:Uric acid (umol/L) 3.72E-06 -17.06285, 2579 UAPSI:Serumuricacid:SI(umol/L) 3.72E-06 -17.06285, 2579 UAP:Uric acid (mg/dL) 3.73E-06 -0.28685, 2579 UAP:Uric acid (mg/dL) 3.73E-06 -0.28685, 2579 UAP:Serumuricacid(mg/dL) 3.73E-06 -0.28685, 2579 (ln+1)UAP:Uric acid (mg/dL) 1.49E-05 -0.04223, 2579 (ln+1)UAP:Uric acid (mg/dL) 1.49E-05 -0.04223, 2579 (ln+1)UAP:Serumuricacid(mg/dL) 1.49E-05 -0.04223, 2579 (ln+1)UAPSI:Uric acid (umol/L) 1.85E-05 -0.05019, 2579 (ln+1)UAPSI:Uric acid (umol/L) 1.85E-05 -0.05019, 2579 NHANES III (ln+1)UAPSI:Serumuricacid:SI(umol/L) 1.85E-05 -0.05019 2579 22229870, Uric acid levels, Urate ABCG2 RS2231142 19503597, Uric acid NHW levels LBDSUASI:Uric acid (umol/L) 4.60E-09 3954 18834626 LBXSUA:Uric acid (mg/dL) 4.60E-09 3954 (ln+1)LBXSUA:Uric acid (mg/dL) 5.03E-08 3954 (ln+1)LBDSUASI:Uric acid (umol/L) 8.48E-08 4.25E- 3954 (ln+1) UAP:Uric acid (mg/dL) 12 4.25E-12 6533 (ln+1)UAP:Uric acid (mg/dL) 4.25E-12 4.25E-12 6533 (ln+1)UAP:Serumuricacid(mg/dL) 4.25E-12 4.25E-12 6533 UAP:Uric acid (mg/dL) 1.10E-13 1.10E-13 6533 UAP:Uric acid (mg/dL) 1.10E-13 1.10E-13 6533 UAP:Serumuricacid(mg/dL) 1.10E-13 1.10E-13 6533 (ln+1)UAPSI:Uric acid (umol/L) 8.36E-12 8.36E-12 6533 (ln+1)UAPSI:Uric acid (umol/L) 8.36E-12 8.36E-12 6533 (ln+1)UAPSI:Serumuricacid:SI(umol/L) 8.36E-12 8.36E-12 6533 UAPSI:Uric acid (umol/L) 1.10E-13 1.10E-13 6533 UAPSI:Uric acid (umol/L) 1.10E-13 1.10E-13 6533 NHANES 9902 UAPSI:Serumuricacid:SI(umol/L) 1.10E-13 1.10E-13 6533

156 -0.10278(0.02588), - (ln + 1)TGP:Triglyceride (mg/dL), 7.39e-05, 15.30157(5.34718), 2046, TGP:Triglyceride (mg/dL), (ln + 0.004257769, -0.06154(0.01623), 2046, 1)TGPSI:Triglyceride (mmol/L), 0.000153666, -0.17284(0.06037), 2046, TGPSI:Triglyceride (mmol/L), (ln + 0.004239193, -0.10278(0.02588), 2046, 1)TGP:Serumtriglycerides(mg/dL), 7.39e-05, - 2046, TGP:Serumtriglycerides(mg/dL), (ln + 0.004257769, 15.30157(5.34718), 2046, 1)TGPSI:Serumtriglycerides:SI(mmol/L), 0.000153666, -0.06154(0.01623), 2046, NHANES III TGPSI:Serumtriglycerides:SI(mmol/L) 0.004239193 -0.17284(0.06037) 2046 (ln + 1)LBDSTRSI:Triglycerides -0.08399(0.01768), (mmol/L), LBDSTRSI:Triglycerides -0.40039(0.07619), RS2892768 (mmol/L), (ln + 1)LBDTRSI:Triglyceride 2.18e-06, 1.65e- -0.11016(0.02378), 1856, BUD13 Triglycerides 18193044 Triglycerides MA 0 (mmol/L), LBDTRSI:Triglyceride 07, 4.13e-06, -0.49245(0.09858), 1856, (mmol/L), (ln + 1)LBXSTR:Triglycerides 7.01e-07, 5.38e- -0.12735(0.02791), 936, 936, (mg/dL), LBXSTR:Triglycerides (mg/dL), 06, 1.65e-07, -35.4655(6.74815), 1856, (ln + 1)LBXTR:Triglyceride (mg/dL), 1.39e-05, 7.12e- -0.16225(0.03714), 1856, NHANES 9902 LBXTR:Triglyceride (mg/dL) 07 -43.5919(8.7321) 936, 936 (ln + 1)TGP:Triglyceride (mg/dL), 2982, TGP:Triglyceride (mg/dL), (ln + 2982, 1)TGPSI:Triglyceride (mmol/L), 2982, TGPSI:Triglyceride (mmol/L), (ln + 8.1e-09, 1.11e- 2982, 1)TGP:Serumtriglycerides(mg/dL), 07, 7.8e-09, -0.12322, -24.4207, 2982, TGP:Serumtriglycerides(mg/dL), (ln + 1.09e-07, 8.1e- -0.07781, -0.27586, 2982, NHANES 1)TGPSI:Serumtriglycerides:SI(mmol/L), 09, 1.11e-07, -0.12322, -24.4207, 2982, combined TGPSI:Serumtriglycerides:SI(mmol/L) 7.8e-09, 1.09e-07 -0.07781, -0.27586 2982 -0.13631(0.03008), - (ln + 1)TGP:Triglyceride (mg/dL), 29.28428(6.26589), 2596, TGP:Triglyceride (mg/dL), (ln + -0.08708(0.01862), 2596, 1)TGPSI:Triglyceride (mmol/L), -0.33067(0.07074), 2596, TGPSI:Triglyceride (mmol/L), (ln + 6.1e-06, 3.11e- -0.13631(0.03008), 2596, 1)TGP:Serumtriglycerides(mg/dL), 06, 3.05e-06, - 2596, TGP:Serumtriglycerides(mg/dL), (ln + 3.1e-06, 6.1e-06, 29.28428(6.26589), 2596, 1)TGPSI:Serumtriglycerides:SI(mmol/L), 3.11e-06, 3.05e- -0.08708(0.01862), 2596, NHANES III TGPSI:Serumtriglycerides:SI(mmol/L) 06, 3.1e-06 -0.33067(0.07074) 2596 (ln + 1)LBDSTRSI:Triglycerides -0.0807(0.01536), - 3957, (mmol/L), LBDSTRSI:Triglycerides 0.2832(0.05871), - 3957, RS2892768 (mmol/L), (ln + 1)LBDTRSI:Triglyceride 1.56e-07, 1.46e- 0.08246(0.02229), 1957, BUD13 Triglycerides 18193044 Triglycerides NHW 0 (mmol/L), LBDTRSI:Triglyceride 06, 0.000222531, -0.29518(0.09722), 1957, (mmol/L), (ln + 1)LBXSTR:Triglycerides 0.00242644, -0.1251(0.02487), - 3957, (mg/dL), LBXSTR:Triglycerides (mg/dL), 5.14e-07, 1.46e- 25.0837(5.20008), 3957, (ln + 1)LBXTR:Triglyceride (mg/dL), 06, 0.000357868, -0.12713(0.03556), 1957, NHANES 9902 LBXTR:Triglyceride (mg/dL) 0.00242071 -26.149(8.61004) 1957 (ln + 1)TGP:Triglyceride (mg/dL), 4553, TGP:Triglyceride (mg/dL), (ln + 4553, 1)TGPSI:Triglyceride (mmol/L), 4553, TGPSI:Triglyceride (mmol/L), (ln + 1.02e-08, 5.34e- 4553, 1)TGP:Serumtriglycerides(mg/dL), 08, 3.2e-09, -0.13188, -27.917, 4553, TGP:Serumtriglycerides(mg/dL), (ln + 5.35e-08, 1.02e- -0.08481, -0.31519, 4553, NHANES 1)TGPSI:Serumtriglycerides:SI(mmol/L), 08, 5.34e-08, -0.13188, -27.917, 4553, combined TGPSI:Serumtriglycerides:SI(mmol/L) 3.2e-09, 5.35e-08 -0.08481, -0.31519 4553 (ln + 1)HDP:HDL-Cholesterol (mg/dL), HDP:HDL-Cholesterol (mg/dL), (ln + 0.002320326, 0.05049(0.01656), 1991, 1)HDPSI:Serum HDL cholesterol: SI 0.004335424, 2.38406(0.83477), 1991, 18193044, (mmol/L), HDPSI:Serum HDL cholesterol: 0.002979971, 0.0275(0.00925), 1991, HDL cholesterol, LPL RS328 17463246, HDL Cholesterol MA SI (mmol/L), (ln + 1)HDP:HDL- 0.004338748, 0.06165(0.02159), 1991, Triglycerides 22171074 Cholesterol(mg/dL), HDP:HDL- 0.002320326, 0.05049(0.01656), 1991, Cholesterol(mg/dL), (ln + 0.004335424, 2.38406(0.83477), 1991, 1)HDPSI:SerumHDLcholesterol:SI(mmol/ 0.002979971, 0.0275(0.00925), 1991, NHANES III L), 0.004338748 0.06165(0.02159) 1991

157 HDPSI:SerumHDLcholesterol:SI(mmol/L)

(ln + 1)LBDHDL:Serum HDL-Cholesterol (mg/dL), LBDHDL:Serum HDL- 0.00110028, 0.05921(0.01811), 1842, Cholesterol (mg/dL), (ln + 0.00217516, 2.90587(0.94668), 1842, 1)LBDHDLSI:HDL-cholesterol (mmol/L), 0.00118488, 0.03344(0.0103), 1842, NHANES 9902 LBDHDLSI:HDL-cholesterol (mmol/L) 0.00215124 0.07519(0.02447) 1842 (ln + 1)HDP:HDL-Cholesterol (mg/dL), HDP:HDL-Cholesterol (mg/dL), (ln + 1)HDPSI:Serum HDL cholesterol: SI 3833, (mmol/L), HDPSI:Serum HDL cholesterol: 3833, SI (mmol/L), (ln + 1)HDP:HDL- 3833, Cholesterol(mg/dL), HDP:HDL- 1.25e-05, 4.21e- 3833, Cholesterol(mg/dL), (ln + 05, 1.7e-05, 0.05353, 2.57485, 3833, 1)HDPSI:SerumHDLcholesterol:SI(mmol/ 4.16e-05, 1.25e- 0.02969, 0.06661, 3833, NHANES L), 05, 4.21e-05, 0.05353, 2.57485, 3833, combined HDPSI:SerumHDLcholesterol:SI(mmol/L) 1.7e-05, 4.16e-05 0.02969, 0.06661 3833 (ln + 1)HDP:HDL-Cholesterol (mg/dL), HDP:HDL-Cholesterol (mg/dL), (ln + 1)HDPSI:Serum HDL cholesterol: SI 0.000187986, 0.05083(0.01359), 2539, (mmol/L), HDPSI:Serum HDL cholesterol: 0.000344989, 2.62653(0.73289), 2539, SI (mmol/L), (ln + 1)HDP:HDL- 0.000217098, 0.02864(0.00773), 2539, Cholesterol(mg/dL), HDP:HDL- 0.000346222, 0.06792(0.01896), 2539, Cholesterol(mg/dL), (ln + 0.000187986, 0.05083(0.01359), 2539, 1)HDPSI:SerumHDLcholesterol:SI(mmol/ 0.000344989, 2.62653(0.73289), 2539, L), 0.000217098, 0.02864(0.00773), 2539, NHANES III HDPSI:SerumHDLcholesterol:SI(mmol/L) 0.000346222 0.06792(0.01896) 2539 (ln + 1)LBDHDL:Serum HDL-Cholesterol 18193044, (mg/dL), LBDHDL:Serum HDL- 0.00137361, 0.03468(0.01083), 3936, HDL cholesterol, LPL RS328 17463246, HDL Cholesterol NHW Cholesterol (mg/dL), (ln + 0.0064059, 1.62794(0.59682), 3936, Triglycerides 22171074 1)LBDHDLSI:HDL-cholesterol (mmol/L), 0.00271434, 0.01883(0.00628), 3936, NHANES 9902 LBDHDLSI:HDL-cholesterol (mmol/L) 0.00681392 0.04177(0.01543) 3936 (ln + 1)HDP:HDL-Cholesterol (mg/dL), HDP:HDL-Cholesterol (mg/dL), (ln + 1)HDPSI:Serum HDL cholesterol: SI 6475, (mmol/L), HDPSI:Serum HDL cholesterol: 6475, SI (mmol/L), (ln + 1)HDP:HDL- 1.19e-06, 1.21e- 6475, Cholesterol(mg/dL), HDP:HDL- 05, 3.06e-06, 6475, Cholesterol(mg/dL), (ln + 1.31e-05, 1.19e- 0.04123, 2.02989, 6475, 1)HDPSI:SerumHDLcholesterol:SI(mmol/ 06, 1.21e-05, 0.0228, 0.05228, 6475, NHANES L), 3.06e-06, 1.31e- 0.04123, 2.02989, 6475, combined HDPSI:SerumHDLcholesterol:SI(mmol/L) 05 0.0228, 0.05228 6475 -0.11608(0.02643), - (ln + 1)TGP:Triglyceride (mg/dL), 1.17e-05, 21.53021(5.52796), 2555, TGP:Triglyceride (mg/dL), (ln + 0.00010081, -0.07197(0.01637), 2555, 1)TGPSI:Triglyceride (mmol/L), 1.14e-05, -0.24315(0.06241), 2555, TGPSI:Triglyceride (mmol/L), (ln + 0.000100372, -0.11608(0.02643), 2555, 18193044, HDL cholesterol, 1)TGP:Serumtriglycerides(mg/dL), 1.17e-05, - 2555, LPL RS328 17463246, Triglycerides NHW Triglycerides TGP:Serumtriglycerides(mg/dL), (ln + 0.00010081, 21.53021(5.52796), 2555, 22171074 1)TGPSI:Serumtriglycerides:SI(mmol/L), 1.14e-05, -0.07197(0.01637), 2555, NHANES III TGPSI:Serumtriglycerides:SI(mmol/L) 0.000100372 -0.24315(0.06241) 2555 (ln + 1)LBDSTRSI:Triglycerides -0.05962(0.01279), 3935, (mmol/L), LBDSTRSI:Triglycerides 3.24e-06, 1.35e- -0.21328(0.04894), 3935, (mmol/L), (ln + 1)LBDTRSI:Triglyceride 05, 0.00331641, -0.053(0.01802), - 1941, NHANES 9902 (mmol/L), LBXSTR:Triglycerides (mg/dL) 1.34e-05 18.8915(4.33468) 3935

158 (ln + 1)TGP:Triglyceride (mg/dL), 4496, TGP:Triglyceride (mg/dL), (ln + 4496, 1)TGPSI:Triglyceride (mmol/L), 2.76e-07, 5.69e- 4496, TGPSI:Triglyceride (mmol/L), (ln + 06, 1.99e-07, 4496, 1)TGP:Serumtriglycerides(mg/dL), 5.67e-06, 2.76e- -0.10035, -19.7943, 4496, TGP:Serumtriglycerides(mg/dL), (ln + 07, 5.69e-06, -0.0632, -0.22353, 4496, NHANES 1)TGPSI:Serumtriglycerides:SI(mmol/L), 1.99e-07, 5.67e- -0.10035, -19.7943, 4496, combined TGPSI:Serumtriglycerides:SI(mmol/L) 06 -0.0632, -0.22353 4496 (ln + 1)HDP:HDL-Cholesterol (mg/dL), HDP:HDL-Cholesterol (mg/dL), (ln + 1)HDPSI:Serum HDL cholesterol: SI 0.04004(0.00947), 2019, (mmol/L), HDPSI:Serum HDL cholesterol: 2.34946(0.54907), 2019, SI (mmol/L), (ln + 1)HDP:HDL- 2.45e-05, 1.97e- 0.02366(0.00555), 2019, Cholesterol(mg/dL), HDP:HDL- 05, 2.13e-05, 0.06067(0.0142), 2019, Cholesterol(mg/dL), (ln + 2.02e-05, 2.45e- 0.04004(0.00947), 2019, Age-related macular 1)HDPSI:SerumHDLcholesterol:SI(mmol/ 05, 1.97e-05, 2.34946(0.54907), 2019, degeneration, 20385819, L), 2.13e-05, 2.02e- 0.02366(0.00555), 2019, Cholesterol, total, LDL 21665990, NHANES III HDPSI:SerumHDLcholesterol:SI(mmol/L) 05 0.06067(0.0142) 2019 cholesterol, Metabolic 19060910, (ln + 1)LBDHDL:Serum HDL-Cholesterol syndrome, Lipid 18193043, (mg/dL), LBDHDL:Serum HDL- 0.00180704, 0.03842(0.01229), 1324, HERPUD metabolism phenotypes, 20694148, RS3764261 HDL Cholesterol NHB Cholesterol (mg/dL), (ln + 0.00224556, 2.16996(0.70875), 1324, 1 - CETP Metabolic syndrome 22286219, 1)LBDHDLSI:HDL-cholesterol (mmol/L), 0.00215103, 0.02218(0.00721), 1324, (bivariate traits), Waist 19359809, NHANES 9902 LBDHDLSI:HDL-cholesterol (mmol/L) 0.0024551 0.05559(0.01832) 1324 circumference and 21386085, related phenotypes, HDL 18454146, (ln + 1)HDP:HDL-Cholesterol (mg/dL), cholesterol, 20686565 HDP:HDL-Cholesterol (mg/dL), (ln + Triglycerides 1)HDPSI:Serum HDL cholesterol: SI 3343, (mmol/L), HDPSI:Serum HDL cholesterol: 3343, SI (mmol/L), (ln + 1)HDP:HDL- 1.65e-07, 1.67e- 3343, Cholesterol(mg/dL), HDP:HDL- 07, 1.73e-07, 3343, Cholesterol(mg/dL), (ln + 1.88e-07, 1.65e- 0.03936, 2.27671, 3343, 1)HDPSI:SerumHDLcholesterol:SI(mmol/ 07, 1.67e-07, 0.02306, 0.05863, 3343, NHANES L), 1.73e-07, 1.88e- 0.03936, 2.27671, 3343, combined HDPSI:SerumHDLcholesterol:SI(mmol/L) 07 0.02306, 0.05863 3343 (ln+1)HDP:HDL-Cholesterol (mg/dL) (ln+1)HDP:HDL-cholesterol (mg/dL) (ln+1)HDP:HDL-Cholesterol(mg/dL) (ln+1)HDPSI:Serum HDL cholesterol: SI 1.70E-09 0.05427 1822 Age-related macular (mmol/L) 1.70E-09 0.05427 1822 degeneration, 20385819, (ln+1)HDPSI:SerumHDLcholesterol:SI(m 1.70E-09 0.05427 1822 Cholesterol, total, LDL 21665990, mol/L) (ln+1)HDP:HDL-Cholesterol 3.30E-09 0.02966 1822 cholesterol, Metabolic 19060910, (mg/dL) 3.30E-09 0.02966 1822 syndrome, Lipid 18193043, HDP:HDL-cholesterol (mg/dL) 1.11E-08 2.57178 1822 HERPUD metabolism phenotypes, 20694148, HDP:HDL-Cholesterol(mg/dL) 1.11E-08 2.57178 1822 RS3764261 HDL Cholesterol MA 1 - CETP Metabolic syndrome 22286219, HDPSI:Serum HDL cholesterol: SI 1.11E-08 2.57178 1822 (bivariate traits), Waist 19359809, (mmol/L) 1.22E-08 0.06631 1822 circumference and 21386085, NHANES III HDPSI:SerumHDLcholesterol:SI(mmol/L) 1.22E-08 0.06631 1822 related phenotypes, HDL 18454146, (ln+1)LBDHDL:Serum HDL-Cholesterol cholesterol, 20686565 (mg/dL) Triglycerides (ln+1)LBDHDLSI:HDL-cholesterol (mmol/L) 1.10E-09 0.0569 1857 LBDHDLSI:HDL-cholesterol (mmol/L) 1.90E-09 0.0319 1857 LBDHDL:Serum HDL-Cholesterol 9.10E-09 0.07255 1857 NHANES 9902 (mg/dL) 9.30E-09 2.80581 1857

159 (ln+1)HDP:HDL-Cholesterol (mg/dL) (ln+1)HDP:HDL-cholesterol (mg/dL) (ln+1)HDP:HDL-Cholesterol(mg/dL) HDP:HDL-Cholesterol (mg/dL) 2.11E-17 0.05483 3679 HDP:HDL-cholesterol (mg/dL) 2.11E-17 0.05483 3679 HDP:HDL-Cholesterol(mg/dL) 2.11E-17 0.05483 3679 (ln+1)HDPSI:Serum HDL cholesterol: SI 1.13E-15 2.64898 3679 (mmol/L) 1.13E-15 2.64898 3679 (ln+1)HDPSI:SerumHDLcholesterol:SI(m 1.13E-15 2.64898 3679 mol/L) 6.37E-17 0.03034 3679 HDPSI:Serum HDL cholesterol: SI 6.37E-17 0.03034 3679 NHANES (mmol/L) 1.24E-15 0.0684 3679 combined HDPSI:SerumHDLcholesterol:SI(mmol/L) 1.24E-15 0.0684 3679 (ln+1)HDP:HDL-Cholesterol (mg/dL) (ln+1)HDP:HDL-cholesterol (mg/dL) (ln+1)HDP:HDL-Cholesterol(mg/dL) (ln+1)HDPSI:Serum HDL cholesterol: SI 3.02E-13 0.06433 2429 (mmol/L) 3.02E-13 0.06433 2429 (ln+1)HDPSI:SerumHDLcholesterol:SI(m 3.02E-13 0.06433 2429 mol/L) 1.04E-12 0.03577 2429 (ln+1)HDP:HDL-Cholesterol (mg/dL) 1.04E-12 0.03577 2429 )HDP:HDL-cholesterol (mg/dL) 2.29E-11 3.18163 2429 HDP:HDL-Cholesterol(mg/dL) 2.29E-11 3.18163 2429 HDPSI:Serum HDL cholesterol: SI 2.29E-11 3.18163 2429 Age-related macular (mmol/L) 2.35E-11 0.08226 2429 degeneration, 20385819, NHANES III HDPSI:SerumHDLcholesterol:SI(mmol/L) 2.35E-11 0.08226 2429 Cholesterol, total, LDL 21665990, (ln+1)LBDHDL:Serum HDL-Cholesterol cholesterol, Metabolic 19060910, (mg/dL) syndrome, Lipid 18193043, LBDHDL:Serum HDL-Cholesterol HERPUD metabolism phenotypes, 20694148, RS3764261 HDL Cholesterol NHW (mg/dL) 1.34E-22 0.06828 3934 1 - CETP Metabolic syndrome 22286219, (ln+1)LBDHDLSI:HDL-cholesterol 2.46E-21 3.64456 3934 (bivariate traits), Waist 19359809, (mmol/L) 2.72E-22 0.03923 3934 circumference and 21386085, NHANES 9902 LBDHDLSI:HDL-cholesterol (mmol/L) 1.66E-21 0.09458 3934 related phenotypes, HDL 18454146, cholesterol, 20686565 (ln+1)HDP:HDL-Cholesterol (mg/dL) Triglycerides (ln+1)HDP:HDL-cholesterol (mg/dL) (ln+1)HDP:HDL-Cholesterol(mg/dL) HDP:HDL-Cholesterol (mg/dL) 1.28E-34 0.06725 6363 HDP:HDL-cholesterol (mg/dL) 1.28E-34 0.06725 6363 HDP:HDL-Cholesterol(mg/dL) 1.28E-34 0.06725 6363 (ln+1)HDPSI:Serum HDL cholesterol: SI 1.82E-31 3.49538 6363 (mmol/L) 1.82E-31 3.49538 6363 (ln+1)HDPSI:SerumHDLcholesterol:SI(m 1.82E-31 3.49538 6363 mol/L) 7.90E-34 0.03819 6363 HDPSI:Serum HDL cholesterol: SI 7.90E-34 0.03819 6363 NHANES (mmol/L) 1.21E-31 0.09059 6363 combined HDPSI:SerumHDLcholesterol:SI(mmol/L) 1.21E-31 0.09059 6363 17463246, (ln + 1)TCP:Total cholesterol (mg/dL), 0.001691436, 0.02541(0.00808), 2448, Triglycerides, HDL 20864672, TCP:Total cholesterol (mg/dL), (ln + 0.005235327, 4.64918(1.66357), 2448, cholesterol, Alzheimer's 19060906, 1)TCPSI:Total cholesterol (mmol/L), 0.001968928, 0.02099(0.00678), 2448, disease (age of onset), 18262040, TCPSI:Total cholesterol (mmol/L), (ln + 0.006181989, 0.12129(0.04426), 2446, Longevity, Alzheimer's 22005931, 1)CHP:Serumcholesterol(mg/dL), 0.002763581, 0.02437(0.00814), 2446, disease (late onset), 20442857, CHP:Serumcholesterol(mg/dL), (ln + 0.006185997, 4.69003(1.71161), 2446, Cholesterol, total, 18193043, 1)TCP:Cholesterol, Total(mg/dL), 0.001691436, 0.02541(0.00808), 2448, APOC1 RS4420638 Cognitive decline, LDL 21740922, Cholesterol NHW NHANES III TCP:Cholesterol, Total(mg/dL) 0.005235327 4.64918(1.66357) 2448 cholesterol, Lipoprotein- 17975299, (ln + 1)LBDSCHSI:Cholesterol (mmol/L), 7.52e-05, 8.77e- 0.01969(0.00497), 3965, associated phospholipase 21196492, LBDSCHSI:Cholesterol (mmol/L), (ln + 05, 0.000188556, 0.12309(0.03135), 3965, A2 activity and mass, 17474819, 1)LBDTCSI:Total cholesterol (mmol/L), 0.000232396, 0.0186(0.00498), 3967, Alzheimer's disease, C- 18802019, LBDTCSI:Total cholesterol (mmol/L), (ln 7.52e-05, 8.72e- 0.11741(0.03187), 3967, reactive protein, 22054870, + 1)LBXSCH:Serum Cholesterol (mg/dL), 05, 0.000189088, 0.02351(0.00593), 3965, Quantitative traits 20686565, LBXSCH:Serum Cholesterol (mg/dL), (ln 0.000235977, 4.76078(1.21212), 3965, 18193044, NHANES 9902 + 1)LBXTC:Total cholesterol (mg/dL), 7.52e-05, 8.72e- 0.02214(0.00592), 3967,

160 22003152, LBXTC:Total cholesterol (mg/dL), (ln + 05 4.53527(1.23228), 3967, 19567438, 1)LBXSCH:Cholesterol, total (mg/dL), 0.02351(0.00593), 3965, 17998437, LBXSCH:Cholesterol, total (mg/dL) 4.76078(1.21212) 3965 21300955, 19197348

(ln + 1)CHPSI:Serum cholesterol: SI 2446, (mmol/L), CHPSI:Serum cholesterol: SI 0.00311326, 2446, (mmol/L), (ln + 1)TCP:Total cholesterol 0.00618199, 6415, (mg/dL), TCP:Total cholesterol (mg/dL), 1.28e-06, 4.59e- 6415, (ln + 1)TCPSI:Total cholesterol (mmol/L), 06, 1.47e-06, 6415, TCPSI:Total cholesterol (mmol/L), (ln + 4.5e-06, 0.02025, 0.12129, 6415, 1)CHPSI:Serumcholesterol:SI(mmol/L), 0.00311326, 0.02325, 4.55306, 2446, CHPSI:Serumcholesterol:SI(mmol/L), (ln + 0.00618199, 0.0194, 0.11785, 2446, NHANES 1)TCP:Cholesterol, Total(mg/dL), 1.28e-06, 4.59e- 0.02025, 0.12129, 6415, combined TCP:Cholesterol, Total(mg/dL) 06 0.02325, 4.55306 6415 (ln + 1)TCP:Total cholesterol (mg/dL), -0.04174(0.00907), 1828, TCP:Total cholesterol (mg/dL), (ln + -8.74544(1.82393), 1828, 1)TCPSI:Total cholesterol (mmol/L), 4.43e-06, 1.76e- -0.03514(0.00754), 1828, TCPSI:Total cholesterol (mmol/L), (ln + 06, 3.39e-06, -0.22624(0.04717), 1828, 1)CHP:Serumcholesterol(mg/dL), 1.74e-06, 1.24e- -0.0395(0.00901), - 1823, CHP:Serumcholesterol(mg/dL), (ln + 05, 3.46e-06, 8.4384(1.81237), - 1823, 1)TCP:Cholesterol, Total(mg/dL), 4.43e-06, 1.76e- 0.04174(0.00907), 1828, NHANES III TCP:Cholesterol, Total(mg/dL) 06 -8.74544(1.82393) 1828 (ln + 1)LBDSCHSI:Cholesterol (mmol/L), -0.03515(0.00727), 1858, LBDSCHSI:Cholesterol (mmol/L), (ln + -0.22185(0.04534), 1858, 1)LBDTCSI:Total cholesterol (mmol/L), -0.03482(0.00722), 1858, Coronary heart disease, 21378988, LBDTCSI:Total cholesterol (mmol/L), (ln 1.45e-06, 1.08e- -0.22036(0.04541), 1858, Response to statin 20339536, + 1)LBXSCH:Serum Cholesterol (mg/dL), 06, 1.53e-06, -0.04188(0.00871), 1858, therapy, Cholesterol, 19060910, LBXSCH:Serum Cholesterol (mg/dL), (ln 1.32e-06, 1.64e- -8.58494(1.75307), 1858, CELSR2 RS646776 total, Progranulin levels, 21087763, Cholesterol MA + 1)LBXTC:Total cholesterol (mg/dL), 06, 1.06e-06, -0.04146(0.00863), 1858, Myocardial infarction 19198609, LBXTC:Total cholesterol (mg/dL), (ln + 1.69e-06, 1.3e- -8.52679(1.75607), 1858, (early onset), LDL 19060911, 1)LBXSCH:Cholesterol, total (mg/dL), 06, 1.64e-06, -0.04188(0.00871), 1858, cholesterol 18193044 NHANES 9902 LBXSCH:Cholesterol, total (mg/dL) 1.06e-06 -8.58494(1.75307) 1858

(ln+1)TCP:Total cholesterol (mg/dL) 7.51E-12 -0.04314, 3686 (ln+1)TCP:Total cholesterol (mg/dL) 7.51E-12 -0.04314, 3686 (ln+1)TCP:Cholesterol,Total(mg/dL) 7.51E-12 -0.04314, 3686 TCP:Total cholesterol (mg/dL) 2.39E-12 -8.9187, 3686 TCP:Total cholesterol (mg/dL) 2.39E-12 -8.9187, 3686 TCP:Cholesterol,Total(mg/dL) 2.39E-12 -8.9187, 3686 (ln+1)TCPSI:Total cholesterol (mmol/L) 4.96E-12 -0.03626, 3686 (ln+1)TCPSI:Total cholesterol (mmol/L) 4.96E-12 -0.03626, 3686 NHANES TCPSI:Total cholesterol (mmol/L) 2.39E-12 -0.23062, 3686 combined TCPSI:Total cholesterol (mmol/L) 2.39E-12 -0.23062 3686

161 (ln + 1)TCP:Total cholesterol (mg/dL), 0.006577291, -0.01958(0.0072), - 2019, TCP:Total cholesterol (mg/dL), (ln + 0.005559009, 3.99152(1.43802), 2019, 1)TCPSI:Total cholesterol (mmol/L), 0.002068281, -0.01866(0.00605), 2023, TCPSI:Total cholesterol (mmol/L), (ln + 0.002449969, -0.11598(0.03824), 2023, 1)CHP:Serumcholesterol(mg/dL), 0.002000334, -0.02239(0.00724), 2023, CHP:Serumcholesterol(mg/dL), (ln + 0.002448868, -4.4851(1.47857), - 2023, 1)TCP:Cholesterol, Total(mg/dL), 0.006577291, 0.01958(0.0072), - 2019, Coronary heart disease, 21378988, NHANES III TCP:Cholesterol, Total(mg/dL) 0.005559009 3.99152(1.43802) 2019 Response to statin 20339536, (ln + 1)LBDSCHSI:Cholesterol (mmol/L), 0.00232206, -0.02112(0.00692), 1329, therapy, Cholesterol, 19060910, LBDSCHSI:Cholesterol (mmol/L), (ln + 0.00280558, -0.12668(0.04231), 1329, CELSR2 RS646776 total, Progranulin levels, 21087763, Cholesterol NHB 1)LBDTCSI:Total cholesterol (mmol/L), 0.000963743, -0.02271(0.00686), 1330, Myocardial infarction 19198609, LBDTCSI:Total cholesterol (mmol/L), (ln 0.00110796, -0.13808(0.04224), 1330, (early onset), LDL 19060911, + 1)LBXSCH:Serum Cholesterol (mg/dL), 0.002152, -0.02559(0.00832), 1329, cholesterol 18193044 LBXSCH:Serum Cholesterol (mg/dL), (ln 0.00279562, -4.90012(1.63612), 1329, + 1)LBXTC:Total cholesterol (mg/dL), 0.000912026, -0.02737(0.00824), 1330, LBXTC:Total cholesterol (mg/dL), (ln + 0.00110835, -5.33943(1.63353), 1330, 1)LBXSCH:Cholesterol, total (mg/dL), 0.002152, -0.02559(0.00832), 1329, NHANES 9902 LBXSCH:Cholesterol, total (mg/dL) 0.00279562 -4.90012(1.63612) 1329 (ln + 1)CHPSI:Serum cholesterol: SI (mmol/L), CHPSI:Serum cholesterol: SI 2023, (mmol/L), (ln + 2023, 1)CHPSI:Serumcholesterol:SI(mmol/L), 0.00206828, 2023, CHPSI:Serumcholesterol:SI(mmol/L), (ln + 0.00244997, 2023, 1)TCP:Total cholesterol (mg/dL), 0.00206828, 3349, TCP:Total cholesterol (mg/dL), (ln + 0.00244997, -0.01866, -0.11598, 3349, 1)TCPSI:Total cholesterol (mmol/L), 2.17e-05, 2.16e- -0.01866, -0.11598, 3349, TCPSI:Total cholesterol (mmol/L), (ln + 05, 2.14e-05, -0.0232, -4.62045, 3349, NHANES 1)TCP:Cholesterol, Total(mg/dL), 2.15e-05, 2.17e- -0.01935, -0.11951, 3349, combined TCP:Cholesterol, Total(mg/dL) 05, 2.16e-05 -0.0232, -4.62045 3349 (ln + 1)TCP:Total cholesterol (mg/dL), TCP:Total cholesterol (mg/dL), (ln + -0.03427(0.00775), 2446, 1)TCPSI:Total cholesterol (mmol/L), -7.03813(1.59343), 2446, TCPSI:Total cholesterol (mmol/L), -0.03054(0.00656), 2444, HAE7:Doctor told you - high cholesterol -0.19805(0.04241), 2444, level, (ln + 1.02e-05, 1.04e- 0.721(0.0935), - 1585, 1)CHP:Serumcholesterol(mg/dL), 05, 3.41e-06, 0.03629(0.0078), - 2444, Coronary heart disease, 21378988, CHP:Serumcholesterol(mg/dL), (ln + 3.18e-06, 0.0005, 7.65901(1.64004), 2444, Response to statin 20339536, 1)TCP:Cholesterol, Total(mg/dL), 3.45e-06, 3.17e- -0.03427(0.00775), 2446, therapy, Cholesterol, 19060910, TCP:Cholesterol, Total(mg/dL), 06, 1.02e-05, -7.03813(1.59343), 2446, CELSR2 RS646776 total, Progranulin levels, 21087763, Cholesterol NHW NHANES III HAE7:Doctortoldbloodcholesterollevelhigh 1.04e-05, 0.0005 0.721(0.0935) 1585 Myocardial infarction 19198609, (ln + 1)LBDSCHSI:Cholesterol (mmol/L), -0.02104(0.00451), 3941, (early onset), LDL 19060911, LBDSCHSI:Cholesterol (mmol/L), (ln + -0.1334(0.0284), - 3941, cholesterol 18193044 1)LBDTCSI:Total cholesterol (mmol/L), 0.02204(0.00451), 3943, LBDTCSI:Total cholesterol (mmol/L), (ln 3.12e-06, 2.74e- -0.14176(0.02883), 3943, + 1)LBXSCH:Serum Cholesterol (mg/dL), 06, 1.07e-06, -0.02503(0.00538), 3941, LBXSCH:Serum Cholesterol (mg/dL), (ln 9.15e-07, 3.39e- -5.16075(1.09826), 3941, + 1)LBXTC:Total cholesterol (mg/dL), 06, 2.7e-06, -0.02613(0.00537), 3943, LBXTC:Total cholesterol (mg/dL), (ln + 1.18e-06, 9.26e- -5.47924(1.1149), - 3943, 1)LBXSCH:Cholesterol, total (mg/dL), 07, 3.39e-06, 0.02503(0.00538), 3941, NHANES 9902 LBXSCH:Cholesterol, total (mg/dL) 2.7e-06 -5.16075(1.09826) 3941

162 (ln + 1)CHPSI:Serum cholesterol: SI (mmol/L), CHPSI:Serum cholesterol: SI 2444, (mmol/L), (ln + 2444, 1)CHPSI:Serumcholesterol:SI(mmol/L), 2444, CHPSI:Serumcholesterol:SI(mmol/L), (ln + 2444, 1)TCP:Total cholesterol (mg/dL), 6389, TCP:Total cholesterol (mg/dL), (ln + 3.41e-06, 3.18e- -0.03054, -0.19805, 6389, 1)TCPSI:Total cholesterol (mmol/L), 06, 3.41e-06, -0.03054, -0.19805, 6389, TCPSI:Total cholesterol (mmol/L), (ln + 3.18e-06, 1e-10, -0.02879, -6.00312, 6389, NHANES 1)TCP:Cholesterol, Total(mg/dL), 1e-10, 1e-10, 1e- -0.02423, -0.15532, 6389, combined TCP:Cholesterol, Total(mg/dL) 10, 1e-10, 1e-10 -0.02879, -6.00312 6389 (ln + 1)LCP:Serum LDL cholesterol (mg/dL), LCP:Serum LDL cholesterol (mg/dL), (ln + 1)LCPSI:Serum LDL 0.003305795, -0.05566(0.01888), cholesterol: SI (mmol/L), LCPSI:Serum 0.001581813, -6.75448(2.12984), LDL cholesterol: SI (mmol/L), (ln + 0.002462103, -0.04207(0.01384), 1)LCP:SerumLDLcholesterol(mg/dL), 0.00157701, -0.17471(0.05508), LCP:SerumLDLcholesterol(mg/dL), (ln + 0.003305795, -0.05566(0.01888), 720, 720, 1)LCPSI:SerumLDLcholesterol:SI(mmol/L 0.001581813, -6.75448(2.12984), 720, 720, ), 0.002462103, -0.04207(0.01384), 720, 720, NHANES III LCPSI:SerumLDLcholesterol:SI(mmol/L) 0.00157701 -0.17471(0.05508) 720, 720 Coronary heart disease, 21378988, (ln + 1)LBDLDL:LDL-cholesterol Response to statin 20339536, (mg/dL), LBDLDL:LDL-cholesterol -0.07835(0.01819), therapy, Cholesterol, 19060910, (mg/dL), (ln + 1)LBDLDLSI:LDL- 1.85e-05, 2.3e- -8.96692(2.10598), CELSR2 RS646776 total, Progranulin levels, 21087763, LDL Cholesterol MA cholesterol (mmol/L), LBDLDLSI:LDL- 05, 1.72e-05, -0.05789(0.01339), 838, 838, Myocardial infarction 19198609, NHANES 9902 cholesterol (mmol/L) 2.31e-05 -0.23188(0.05447) 838, 838 (early onset), LDL 19060911, cholesterol 18193044 (ln + 1)LCP:Serum LDL cholesterol (mg/dL), (ln + 1)LCP:SerumLDLcholesterol(mg/dL), LCP:Serum LDL cholesterol (mg/dL), (ln + 1558, 1)LCPSI:Serum LDL cholesterol: SI 1558, (mmol/L), LCPSI:Serum LDL cholesterol: 1558, SI (mmol/L), 1.13e-07, 1.13e- 1558, LCP:SerumLDLcholesterol(mg/dL), (ln + 07, 5.6e-08, 7.3e- -0.06987, -0.06987, 1558, 1)LCPSI:SerumLDLcholesterol:SI(mmol/L 08, 5.6e-08, 5.6e- -8.17758, -0.0521, 1558, NHANES ), 08, 7.3e-08, 5.6e- -0.21149, -8.17758, 1558, combined LCPSI:SerumLDLcholesterol:SI(mmol/L) 08 -0.0521, -0.21149 1558 (ln + 1)TGP:Triglyceride (mg/dL), 0.001379921, 0.06271(0.01958), 2519, TGP:Triglyceride (mg/dL), (ln + 0.006528063, 11.15189(4.09653), 2519, 1)TGPSI:Triglyceride (mmol/L), 0.001449048, 0.03866(0.01213), 2519, TGPSI:Triglyceride (mmol/L), (ln + 0.006567861, 0.12582(0.04625), 2519, 1)TGP:Serumtriglycerides(mg/dL), 0.001379921, 0.06271(0.01958), 2519, TGP:Serumtriglycerides(mg/dL), (ln + 0.006528063, 11.15189(4.09653), 2519, 1)TGPSI:Serumtriglycerides:SI(mmol/L), 0.001449048, 0.03866(0.01213), 2519, NHANES III TGPSI:Serumtriglycerides:SI(mmol/L) 0.006567861 0.12582(0.04625) 2519 (ln + 1)LBDSTRSI:Triglycerides 0.04234(0.00964), 3935, (mmol/L), LBDSTRSI:Triglycerides 1.14e-05, 0.12764(0.0369), 3935, (mmol/L), (ln + 1)LBDTRSI:Triglyceride 0.000548183, 0.05766(0.01351), 1942, Metabolic traits, 19060910, (mmol/L), LBDTRSI:Triglyceride 2.05e-05, 0.18082(0.05908), 1942, APOB rs673548 Triglycerides NHW Metabolic syndrome 22399527 (mmol/L), (ln + 1)LBXSTR:Triglycerides 0.0022372, 5.5e- 0.07094(0.01559), 3935, (mg/dL), LBXSTR:Triglycerides (mg/dL), 06, 0.000548696, 11.3045(3.26849), 3935, (ln + 1)LBXTR:Triglyceride (mg/dL), 1.05e-05, 0.09507(0.02152), 1942, NHANES 9902 LBXTR:Triglyceride (mg/dL) 0.00223021 16.0195(5.23202) 1942 (ln + 1)TGP:Triglyceride (mg/dL), 4461, TGP:Triglyceride (mg/dL), (ln + 4461, 1)TGPSI:Triglyceride (mmol/L), 1.23e-07, 4.41e- 4461, TGPSI:Triglyceride (mmol/L), (ln + 05, 2.08e-07, 4461, 1)TGP:Serumtriglycerides(mg/dL), 4.45e-05, 1.23e- 0.07688, 13.2836, 4461, TGP:Serumtriglycerides(mg/dL), (ln + 07, 4.41e-05, 0.04698, 0.1499, 4461, NHANES 1)TGPSI:Serumtriglycerides:SI(mmol/L), 2.08e-07, 4.45e- 0.07688, 13.2836, 4461, combined TGPSI:Serumtriglycerides:SI(mmol/L) 05 0.04698, 0.1499 4461

163 -0.03718(0.00754), 2067, (ln + 1)UAP:Uric acid (mg/dL), UAP:Uric -0.22373(0.04695), 2067, acid (mg/dL), (ln + 1)UAPSI:Uric acid 8.73e-07, 2.02e- -0.04539(0.0092), - 2067, (umol/L), UAPSI:Uric acid (umol/L), (ln + 06, 8.79e-07, 13.30775(2.79264), 2067, 1)UAP:Serumuricacid(mg/dL), 2.02e-06, 8.73e- -0.03718(0.00754), 2067, UAP:Serumuricacid(mg/dL), (ln + 07, 2.02e-06, -0.22373(0.04695), 2067, 1)UAPSI:Serumuricacid:SI(umol/L), 8.79e-07, 2.02e- -0.04539(0.0092), - 2067, NHANES III UAPSI:Serumuricacid:SI(umol/L) 06 13.30775(2.79264) 2067

(ln + 1)LBDSUASI:Uric acid (umol/L), 6.91e-05, -0.07507, 1324, SLC2A9 rs6855911 Urate levels 17997608 Uric acid NHB LBDSUASI:Uric acid (umol/L), (ln + 0.00010151, -20.4983, 1324, 1)LBXSUA:Uric acid (mg/dL), 5.34e-05, -0.06122, 1324, NHANES 9902 LBXSUA:Uric acid (mg/dL) 0.000101439 -0.34463 1324 3391, (ln + 1)UAP:Uric acid (mg/dL), UAP:Uric 3391, acid (mg/dL), (ln + 1)UAPSI:Uric acid 3391, (umol/L), UAPSI:Uric acid (umol/L), (ln + 3391, 1)UAP:Serumuricacid(mg/dL), 8e-10, 3.5e-09, -0.03685, -0.22383, 3391, UAP:Serumuricacid(mg/dL), (ln + 9e-10, 3.5e-09, -0.04467, -13.3132, 3391, NHANES 1)UAPSI:Serumuricacid:SI(umol/L), 8e-10, 3.5e-09, -0.03685, -0.22383, 3391, combined UAPSI:Serumuricacid:SI(umol/L) 9e-10, 3.5e-09 -0.04467, -13.3132 3391

(ln+1)UAP:Uric acid (mg/dL) 5.10E-09 -0.04367, 2013 (ln+1)UAP:Uric acid (mg/dL) 5.10E-09 -0.04367, 2013 (ln+1)UAP:Serumuricacid(mg/dL) 5.10E-09 -0.04367, 2013 UAP:Uric acid (mg/dL) 6.10E-09 -0.26822, 2013 UAP:Uric acid (mg/dL) 6.10E-09 -0.26822, 2013 UAP:Serumuricacid(mg/dL) 6.10E-09 -0.26822, 2013 UAPSI:Uric acid (umol/L) 6.10E-09 -15.95256, 2013 UAPSI:Uric acid (umol/L) 6.10E-09 -15.95256, 2013 UAPSI:Serumuricacid:SI(umol/L) 6.10E-09 -15.95256, 2013 (ln+1)UAPSI:Uric acid (umol/L) 1.99E-08 -0.05155, 2013 (ln+1)UAPSI:Uric acid (umol/L) 1.99E-08 -0.05155, 2013 NHANES III (ln+1)UAPSI:Serumuricacid:SI(umol/L) 1.99E-08 -0.05155 2013

1.83E-15 -0.07507, 1863 SLC2A9 rs6855911 Urate levels 17997608 Uric acid MA (ln+1)Uric acid (umol/L), LBDSUASI:Uric 5.59E-13 -20.4983, 1863 acid (umol/L), (ln+1)LBXSUA:Uric acid 4.20E-15 -0.06122, 1863 NHANES 9902 (mg/dL), LBXSUA:Uric acid (mg/dL) 5.60E-13 -0.34463 1863

(ln+1)UAP:Uric acid (mg/dL) 2.87E-22 -0.05231, 3876 (ln+1)UAP:Uric acid (mg/dL) 2.87E-22 -0.05231, 3876 (ln+1)UAP:Serumuricacid(mg/dL) 2.87E-22 -0.05231, 3876 UAP:Uric acid (mg/dL) 2.45E-20 -0.30623, 3876 UAP:Uric acid (mg/dL) 2.45E-20 -0.30623, 3876 UAP:Serumuricacid(mg/dL) 2.45E-20 -0.30623, 3876 (ln+1)UAPSI:Uric acid (umol/L) 7.78E-22 -0.06309, 3876 (ln+1)UAPSI:Uric acid (umol/L) 7.78E-22 -0.06309, 3876 (ln+1)UAPSI:Serumuricacid:SI(umol/L) 7.78E-22 -0.06309, 3876 UAPSI:Uric acid (umol/L) 2.43E-20 -18.2142, 3876 NHANES UAPSI:Uric acid (umol/L) 2.43E-20 -18.2142, 3876 combined UAPSI:Serumuricacid:SI(umol/L) 2.43E-20 -18.2142 3876

164 (ln+1)UAP:Uric acid (mg/dL) 1.73E-14 -0.05399, 2571 (ln+1)UAP:Uric acid (mg/dL) 1.73E-14 -0.05399, 2571 (ln+1)UAP:Serumuricacid(mg/dL) 1.73E-14 -0.05399, 2571 UAP:Uric acid (mg/dL) 1.79E-11 -0.30171, 2571 UAP:Uric acid (mg/dL) 1.79E-11 -0.30171, 2571 UAP:Serumuricacid(mg/dL) 1.79E-11 -0.30171, 2571 (ln+1)UAPSI:Uric acid (umol/L) 5.90E-15 -0.06604, 2571 (ln+1)UAPSI:Uric acid (umol/L) 5.90E-15 -0.06604, 2571 (ln+1)UAPSI:Serumuricacid:SI(umol/L) 5.90E-15 -0.06604, 2571 UAPSI:Uric acid (umol/L) 1.78E-11 -17.94728, 2571 UAPSI:Uric acid (umol/L) 1.78E-11 -17.94728, 2571 NHANES III UAPSI:Serumuricacid:SI(umol/L) 1.78E-11 -17.94728, 2571

(ln+1)LBDSUASI:Uric acid (umol/L) 5.75E-21 -0.06858, 3647 SLC2A9 rs6855911 Urate levels 17997608 Uric acid NHW LBDSUASI:Uric acid (umol/L) 4.01E-17 -19.3409, 3647 (ln+1)LBXSUA:Uric acid (mg/dL) 2.61E-20 -0.05605, 3720 NHANES 9902 LBXSUA:Uric acid (mg/dL) 4.01E-17 -0.32515 3720

(ln+1)UAP:Uric acid (mg/dL) 2.52E-33 -0.05533, 5836 (ln+1)UAP:Uric acid (mg/dL) 2.52E-33 -0.05533, 5836 (ln+1)UAP:Serumuricacid(mg/dL) 2.52E-33 -0.05533, 5836 UAP:Uric acid (mg/dL) 3.62E-27 -0.31663, 5836 UAP:Uric acid (mg/dL) 3.62E-27 -0.31663, 5836 UAP:Serumuricacid(mg/dL) 3.62E-27 -0.31663, 5836 (ln+1)UAPSI:Uric acid (umol/L) 1.93E-34 -0.06768, 5837 (ln+1)UAPSI:Uric acid (umol/L) 1.93E-34 -0.06768, 5837 (ln+1)UAPSI:Serumuricacid:SI(umol/L) 1.93E-34 -0.06768, 5837 UAPSI:Uric acid (umol/L) 3.59E-27 -18.8343, 5837 NHANES UAPSI:Uric acid (umol/L) 3.59E-27 -18.8343, 5837 combined UAPSI:Serumuricacid:SI(umol/L) 3.59E-27 -18.8343 5837 (ln + 1)TGP:Triglyceride (mg/dL), 0.001946297, -0.05957(0.0192), - 2438, TGP:Triglyceride (mg/dL), (ln + 0.005663677, 11.01723(3.97863), 2438, 1)TGPSI:Triglyceride (mmol/L), 0.002059369, -0.03664(0.01188), 2438, TGPSI:Triglyceride (mmol/L), (ln + 0.005675023, -0.12436(0.04492), 2438, 1)TGP:Serumtriglycerides(mg/dL), 0.001946297, -0.05957(0.0192), - 2438, TGP:Serumtriglycerides(mg/dL), (ln + 0.005663677, 11.01723(3.97863), 2438, 1)TGPSI:Serumtriglycerides:SI(mmol/L), 0.002059369, -0.03664(0.01188), 2438, NHANES III TGPSI:Serumtriglycerides:SI(mmol/L) 0.005675023 -0.12436(0.04492) 2438 (ln + 1)LBDSTRSI:Triglycerides 0.000199788, -0.03399(0.00913), 3964, (mmol/L), LBDSTRSI:Triglycerides 0.00155008, -0.11051(0.03489), 3964, (mmol/L), (ln + 1)LBDTRSI:Triglyceride 0.000459783, -0.04534(0.01292), 1957, C2orf43 - (mmol/L), LBDTRSI:Triglyceride 0.00810739, -0.14933(0.05634), 1957, rs7557067 Triglycerides 19060906 Triglycerides NHW APOB (mmol/L), (ln + 1)LBXSTR:Triglycerides 0.000135857, -0.05648(0.01479), 3964, (mg/dL), LBXSTR:Triglycerides (mg/dL), 0.00155158, -9.78717(3.0903), - 3964, (ln + 1)LBXTR:Triglyceride (mg/dL), 0.000268456, 0.07525(0.02061), 1957, NHANES 9902 LBXTR:Triglyceride (mg/dL) 0.00807072 -13.2331(4.9902) 1957 (ln + 1)TGP:Triglyceride (mg/dL), 2.08e-06, 4395, TGP:Triglyceride (mg/dL), (ln + 0.000123491, 4395, 1)TGPSI:Triglyceride (mmol/L), 3.43e-06, 4395, TGPSI:Triglyceride (mmol/L), (ln + 0.000124282, 4395, 1)TGP:Serumtriglycerides(mg/dL), 2.08e-06, -0.06687, -12.0477, 4395, TGP:Serumtriglycerides(mg/dL), (ln + 0.000123491, -0.0407, -0.13597, 4395, NHANES 1)TGPSI:Serumtriglycerides:SI(mmol/L), 3.43e-06, -0.06687, -12.0477, 4395, combined TGPSI:Serumtriglycerides:SI(mmol/L) 0.000124282 -0.0407, -0.13597 4395

165 (ln + 1)G1P:Glucose, plasma (mg/dL), (ln 0.004292765, 0.02794(0.00977), 1638, + 1)G1PSI:Plasma glucose: SI(mmol/L), 0.004839575, 0.02463(0.00873), 1638, (ln + 1)G1P:Glucose, plasma(mg/dL), (ln + 0.004292765, 0.02794(0.00977), 1638, NHANES III 1)G1PSI:Plasmaglucose:SI(mmol/L) 0.004839575 0.02463(0.00873) 1638 18193043, Metabolic syndrome, (ln + 1)LBDSGLSI:Glucose (mmol/L), (ln 0.00140782, 0.02453(0.00767), 1872, 22399527, NHANES 9902 + 1)LBXSGL:Serum Glucose (mg/dL) 0.000999451 0.02876(0.00873) 1872 Phospholipid levels 19060911, (plasma), Triglycerides, (ln + 1)G1P:Glucose, plasma (mg/dL), 0.000416027, 2573, 21829377, Fasting glucose-related G1P:Glucose, plasma (mg/dL), (ln + 0.00485275, 2573, 18193044, GCKR RS780094 traits, LDL cholesterol, Glucose MA 1)G1PSI:Plasma glucose: SI(mmol/L), 0.000552933, 2573, 18179892, Fasting insulin-related G1PSI:Plasma glucose: SI(mmol/L), (ln + 0.00480957, 2573, 20081858, traits, Uric acid levels, 1)SGP:Glucose (mg/dL), SGP:Glucose 8.81e-05, 3885, 19503597, Metabolic traits, C- (mg/dL), (ln + 1)SGPSI:Serum glucose: SI 0.00209201, 3885, 21886157, reactive protein (mmol/L), SGPSI:Serum glucose: SI 0.000123227, 3885, 18439548 (mmol/L), (ln + 1)G1P:Glucose, 0.00209745, 3885, plasma(mg/dL), G1P:Glucose, 0.000416027, 0.02706, 3.56765, 2573, plasma(mg/dL), (ln + 0.00485275, 0.02359, 0.1983, 2573, 1)G1PSI:Plasmaglucose:SI(mmol/L), 0.000552933, 0.02343, 2.86114, 2573, G1PSI:Plasmaglucose:SI(mmol/L), (ln + 0.00480957, 0.02029, 0.15879, 2573, 1)SGP:Glucose(mg/dL), 8.81e-05, 0.02706, 3.56765, 3885, SGP:Glucose(mg/dL), (ln + 0.00209201, 0.02359, 0.1983, 3885, NHANES 1)SGPSI:Serumglucose:SI(mmol/L), 0.000123227, 0.02343, 2.86114, 3885, combined SGPSI:Serumglucose:SI(mmol/L) 0.00209745 0.02029, 0.15879 3885 (ln + 1)TCP:Total cholesterol (mg/dL), 0.002977697, 0.02272(0.00764), 2040, TCP:Total cholesterol (mg/dL), (ln + 0.00341676, 4.49692(1.53428), 2040, 1)TCPSI:Total cholesterol (mmol/L), 0.002103869, 0.01954(0.00635), 2034, TCPSI:Total cholesterol (mmol/L), (ln + 0.002499759, 0.11993(0.03962), 2034, 1)CHP:Serumcholesterol(mg/dL), 0.00217528, 0.02335(0.00761), 2034, CHP:Serumcholesterol(mg/dL), (ln + 0.002502777, 4.63697(1.53198), 2034, 1)TCP:Cholesterol, Total(mg/dL), 0.002977697, 0.02272(0.00764), 2040, NHANES III TCP:Cholesterol, Total(mg/dL) 0.00341676 4.49692(1.53428) 2040 (ln + 1)LBDSCHSI:Cholesterol (mmol/L), 0.00319823, 0.01786(0.00605), 1857, Cholesterol, total, LDL LBDSCHSI:Cholesterol (mmol/L), (ln + 0.00170523, 0.11839(0.03768), 1857, cholesterol, Phospholipid 22359512, 1)LBDTCSI:Total cholesterol (mmol/L), 0.00150198, 0.01909(0.006), 1857, levels (plasma), 20657596, LBDTCSI:Total cholesterol (mmol/L), (ln 0.000842135, 0.12611(0.03771), 1857, Hypertriglyceridemia, 21729881, + 1)LBXSCH:Serum Cholesterol (mg/dL), 0.00345819, 0.02122(0.00725), 1857, Vitamin E levels, 20864672, LBXSCH:Serum Cholesterol (mg/dL), (ln 0.00169641, 4.58014(1.45707), 1857, Metabolic syndrome, ZNF259 rs964184 22399527, Cholesterol MA + 1)LBXTC:Total cholesterol (mg/dL), 0.00160427, 0.02269(0.00718), 1857, Triglycerides, 20686565, LBXTC:Total cholesterol (mg/dL), (ln + 0.00083616, 4.87961(1.45833), 1857, Lipoprotein-associated 22003152, 1)LBXSCH:Cholesterol, total (mg/dL), 0.00345819, 0.02122(0.00725), 1857, phospholipase A2 19060906, NHANES 9902 LBXSCH:Cholesterol, total (mg/dL) 0.00169641 4.58014(1.45707) 1857 activity and mass, HDL 21378990 cholesterol, Coronary (ln + 1)CHPSI:Serum cholesterol: SI heart disease (mmol/L), CHPSI:Serum cholesterol: SI 2034, (mmol/L), (ln + 2034, 1)CHPSI:Serumcholesterol:SI(mmol/L), 0.00210387, 2034, CHPSI:Serumcholesterol:SI(mmol/L), (ln + 0.00249976, 2034, 1)TCP:Total cholesterol (mg/dL), 0.00210387, 3897, TCP:Total cholesterol (mg/dL), (ln + 0.00249976, 0.01954, 0.11993, 3897, 1)TCPSI:Total cholesterol (mmol/L), 8.65e-06, 5.8e- 0.01954, 0.11993, 3897, TCPSI:Total cholesterol (mmol/L), (ln + 06, 7.93e-06, 0.02356, 4.8388, 3897, NHANES 1)TCP:Cholesterol, Total(mg/dL), 5.88e-06, 8.65e- 0.01972, 0.12505, 3897, combined TCP:Cholesterol, Total(mg/dL) 06, 5.8e-06 0.02356, 4.8388 3897 Cholesterol, total, LDL 22359512, cholesterol, Phospholipid 20657596, ZNF259 rs964184 Cholesterol NHW levels (plasma), 21729881, TCPSI:Total cholesterol (mmol/L), 0.008286225, 0.12633(0.04781), 2592, Hypertriglyceridemia, 20864672, NHANES III CHP:Serumcholesterol(mg/dL) 0.00828002 4.88553(1.84885) 2592

166 Vitamin E levels, 22399527, (ln + 1)LBDSCHSI:Cholesterol (mmol/L), 0.00820609, 0.01445(0.00546), 3950, Metabolic syndrome, 20686565, LBDSCHSI:Cholesterol (mmol/L), (ln + 0.00365223, 0.10025(0.03447), 3950, Triglycerides, 22003152, 1)LBDTCSI:Total cholesterol (mmol/L), 0.00424458, 0.01565(0.00547), 3952, Lipoprotein-associated 19060906, LBDTCSI:Total cholesterol (mmol/L), (ln 0.00180391, 0.10938(0.03503), 3952, phospholipase A2 21378990 + 1)LBXSCH:Serum Cholesterol (mg/dL), 0.00980405, 0.01686(0.00652), 3950, activity and mass, HDL LBXSCH:Serum Cholesterol (mg/dL), (ln 0.00366616, 3.87461(1.33273), 3950, cholesterol, Coronary + 1)LBXTC:Total cholesterol (mg/dL), 0.00516521, 0.01822(0.00651), 3952, heart disease LBXTC:Total cholesterol (mg/dL), (ln + 0.00181404, 4.22743(1.35441), 3952, 1)LBXSCH:Cholesterol, total (mg/dL), 0.00980405, 0.01686(0.00652), 3950, NHANES 9902 LBXSCH:Cholesterol, total (mg/dL) 0.00366616 3.87461(1.33273) 3950 CHPSI:Serum cholesterol: SI (mmol/L), 0.00828622, 2592, CHPSI:Serumcholesterol:SI(mmol/L), (ln + 0.00828622, 2592, 1)TCP:Total cholesterol (mg/dL), 0.000663022, 6546, TCP:Total cholesterol (mg/dL), (ln + 0.000210561, 6546, 1)TCPSI:Total cholesterol (mmol/L), 0.000534475, 0.12633, 0.12633, 6546, TCPSI:Total cholesterol (mmol/L), (ln + 0.00020968, 0.01791, 4.02551, 6546, NHANES 1)TCP:Cholesterol, Total(mg/dL), 0.000663022, 0.01528, 0.10413, 6546, combined TCP:Cholesterol, Total(mg/dL) 0.000210561 0.01791, 4.02551 6546 68.53293(16.58055 Cholesterol, total, LDL ), 1.59158(0.385), 2039, cholesterol, Phospholipid VEP:Vitamin E (ug/dL), VEPSI:Vitamin E 3.72e-05, 3.71e- 68.53293(16.58055 2039, 22359512, levels (plasma), NHANES III (umol/L), VEP:SerumvitaminE(ug/dL) 05, 3.72e-05 ) 2039 20657596, Hypertriglyceridemia, (ln + 1)LBDVIESI:Vitamin E (umol/L), 0.000769562, 0.05818(0.01724), 21729881, Vitamin E levels, LBDVIESI:Vitamin E (umol/L), (ln + 0.000796038, 2.1097(0.62696), 20864672, Metabolic syndrome, 1)LBXVIE:Vitamin E (ug/dL), 0.000780217, 0.06012(0.01784), 955, 955, ZNF259 rs964184 22399527, Vitamin E MA Triglycerides, NHANES 9902 LBXVIE:Vitamin E (ug/dL) 0.000795988 90.8571(27.0006) 955, 955 20686565, Lipoprotein-associated 22003152, (ln + 1)VEP:Vitamin E (ug/dL), 2994, phospholipase A2 19060906, VEP:Vitamin E (ug/dL), (ln + 2994, activity and mass, HDL 21378990 1)VEPSI:Vitamin E (umol/L), 1.04e-08, 5.79e- 2994, cholesterol, Coronary VEPSI:Vitamin E (umol/L), (ln + 08, 1.01e-08, 0.05814, 78.5202, 2994, heart disease NHANES 1)VEP:SerumvitaminE(ug/dL), 5.77e-08, 1.04e- 0.05603, 1.8234, 2994, combined VEP:SerumvitaminE(ug/dL) 08, 5.79e-08 0.05814, 78.5202 2994 117.18899(23.5141 Cholesterol, total, LDL 7), 2.72133(0.546), 2594, cholesterol, Phospholipid VEP:Vitamin E (ug/dL), VEPSI:Vitamin E 6.65e-07, 6.63e- 117.18899(23.5141 2594, 22359512, levels (plasma), NHANES III (umol/L), VEP:SerumvitaminE(ug/dL) 07, 6.65e-07 7) 2594 20657596, Hypertriglyceridemia, (ln + 1)LBDVIESI:Vitamin E (umol/L), 0.000200623, 0.07536(0.02022), 1622, 21729881, Vitamin E levels, LBDVIESI:Vitamin E (umol/L), (ln + 0.00100556, 2.61418(0.79338), 1622, 20864672, Metabolic syndrome, 1)LBXVIE:Vitamin E (ug/dL), 0.000193238, 0.07794(0.02086), 1622, ZNF259 rs964184 22399527, Vitamin E NHW Triglycerides, NHANES 9902 LBXVIE:Vitamin E (ug/dL) 0.00100418 112.597(34.1684) 1622 20686565, Lipoprotein-associated 22003152, (ln + 1)VEP:Vitamin E (ug/dL), 4216, phospholipase A2 19060906, VEP:Vitamin E (ug/dL), (ln + 4216, activity and mass, HDL 21378990 1)VEPSI:Vitamin E (umol/L), 4216, cholesterol, Coronary VEPSI:Vitamin E (umol/L), (ln + 1e-10, 3.4e-09, 0.08205, 117.182, 4216, heart disease NHANES 1)VEP:SerumvitaminE(ug/dL), 1e-10, 3.4e-09, 0.07935, 2.72096, 4216, combined VEP:SerumvitaminE(ug/dL) 1e-10, 3.4e-09 0.08205, 117.182 4216

Cholesterol, total, LDL (ln+1)TGP:Triglyceride (mg/dL) cholesterol, Phospholipid (ln+1)TGP:Triglyceride (mg/dL) 5.11E-13 0.14365, 2040 22359512, levels (plasma), (ln+1)TGP:Serumtriglycerides(mg/dL) 5.11E-13 0.14365, 2040 20657596, Hypertriglyceridemia, TGP:Triglyceride (mg/dL) 5.11E-13 0.14365, 2040 21729881, Vitamin E levels, TGP:Triglyceride (mg/dL) 5.56E-12 28.31933, 2040 20864672, ZNF259 Metabolic syndrome, TGP:Serumtriglycerides(mg/dL) 5.56E-12 28.31933, 2040 rs964184 22399527, Triglycerides MA Triglycerides, (ln+1)TGPSI:Triglyceride (mmol/L) 5.56E-12 28.31933, 2040 20686565, Lipoprotein-associated (ln+1)TGPSI:Triglyceride (mmol/L) 1.72E-13 0.09182, 2040 22003152, phospholipase A2 (ln+1)TGPSI:Serumtriglycerides:SI(mmol/ 1.72E-13 0.09182, 2040 19060906, activity and mass, HDL L) 1.72E-13 0.09182, 2040 21378990 cholesterol, Coronary TGPSI:Triglyceride (mmol/L) 5.51E-12 0.31978, 2040 heart disease TGPSI:Triglyceride (mmol/L) 5.51E-12 0.31978, 2040 NHANES III TGPSI:Serumtriglycerides:SI(mmol/L) 5.51E-12 0.31978 2040

167 (ln+1)LBDSTRSI:Triglycerides (mmol/L) 3.91E-20 0.11911, 1857 LBDSTRSI:Triglycerides (mmol/L) 4.22E-18 0.48298, 1857 (ln+1)LBXSTR:Triglycerides (mg/dL) 3.66E-19 0.18341, 1857 NHANES 9902 LBXSTR:Triglycerides (mg/dL) 4.21E-18 42.7805 1857

(ln+1)TGP:Triglyceride (mg/dL) (ln+1)TGP:Triglyceride (mg/dL) 9.55E-22 0.15599, 2977 (ln+1)TGP:Serumtriglycerides(mg/dL) 9.55E-22 0.15599, 2977 TGP:Triglyceride (mg/dL) 9.55E-22 0.15599, 2977 TGP:Triglyceride (mg/dL) 1.18E-21 33.5258, 2977 TGP:Serumtriglycerides(mg/dL) 1.18E-21 33.5258, 2977 (ln+1)TGPSI:Triglyceride (mmol/L) 1.18E-21 33.5258, 2977 (ln+1)TGPSI:Triglyceride (mmol/L) 7.12E-23 0.10112, 2977 (ln+1)TGPSI:Serumtriglycerides:SI(mmol/ 7.12E-23 0.10112, 2977 L) 7.12E-23 0.10112, 2977 TGPSI:Triglyceride (mmol/L) 1.14E-21 0.37861, 2977 NHANES TGPSI:Triglyceride (mmol/L) 1.14E-21 0.37861, 2977 combined TGPSI:Serumtriglycerides:SI(mmol/L) 1.14E-21 0.37861 2977

(ln+1)TGPSI:Triglyceride (mmol/L) (ln+1)TGPSI:Triglyceride (mmol/L) 6.00E-09 0.08021, 2594 (ln+1)TGPSI:Serumtriglycerides:SI(mmol/ 6.00E-09 0.08021, 2594 L) 6.00E-09 0.08021, 2594 (ln+1)TGP:Triglyceride (mg/dL) 1.37E-08 0.1265, 2594 (ln+1)TGP:Triglyceride (mg/dL) 1.37E-08 0.1265, 2594 (ln+1)TGP:Serumtriglycerides(mg/dL) 1.37E-08 0.1265, 2594 TGPSI:Triglyceride (mmol/L) 3.92E-08 0.28803, 2594 TGPSI:Triglyceride (mmol/L) 3.92E-08 0.28803, 2594 TGPSI:Serumtriglycerides:SI(mmol/L) 3.92E-08 0.28803, 2594 Cholesterol, total, LDL TGP:Triglyceride (mg/dL) 3.94E-08 25.50696, 2594 cholesterol, Phospholipid TGP:Triglyceride (mg/dL) 3.94E-08 25.50696, 2594 22359512, levels (plasma), NHANES III TGP:Serumtriglycerides(mg/dL) 3.94E-08 25.50696 2594 20657596, Hypertriglyceridemia, (ln+1)LBDSTRSI:Triglycerides (mmol/L) 3.81E-13 3950 21729881, Vitamin E levels, (ln+1)LBXSTR:Triglycerides (mg/dL) 5.56E-13 1.00E- 0.08141, 3950 20864672, ZNF259 Metabolic syndrome, (ln+1)LBDTRSI:Triglyceride (mmol/L) 10 0.13095 ,0.10358, 1954 rs964184 22399527, Triglycerides NHW Triglycerides, (ln+1)LBXTR:Triglyceride (mg/dL) 1.00E-10 0.16361, 1954 20686565, Lipoprotein-associated LBDSTRSI:Triglycerides (mmol/L) 5.00E-10 0.26776, 3950 22003152, phospholipase A2 NHANES 9902 LBXSTR:Triglycerides (mg/dL) 5.00E-10 23.7171 3950 19060906, activity and mass, HDL 21378990 cholesterol, Coronary (ln+1)TGP:Triglyceride (mg/dL) heart disease (ln+1)TGP:Triglyceride (mg/dL) 2.88E-17 0.14139, 4548 (ln+1)TGP:Serumtriglycerides(mg/dL) 2.88E-17 0.14139, 4548 TGP:Triglyceride (mg/dL) 2.88E-17 0.14139, 4548 TGP:Triglyceride (mg/dL) 1.64E-13 27.5495, 4548 TGP:Serumtriglycerides(mg/dL) 1.64E-13 27.5495, 4548 (ln+1)TGPSI:Triglyceride (mmol/L) 1.64E-13 27.5495, 4548 (ln+1)TGPSI:Triglyceride (mmol/L) 7.25E-18 0.08964, 4548 (ln+1)TGPSI:Serumtriglycerides:SI(mmol/ 7.25E-18 0.08964, 4548 L) 7.25E-18 0.08964, 4548 TGPSI:Triglyceride (mmol/L) 1.63E-13 0.31105, 4548 NHANES TGPSI:Triglyceride (mmol/L) 1.63E-13 0.31105, 4548 combined TGPSI:Serumtriglycerides:SI(mmol/L) 1.63E-13 0.31105 4548

168 (ln+1)UAP:Uric acid (mg/dL) 2.00E-10 -0.05109, 1820 (ln+1)UAP:Uric acid (mg/dL) 2.00E-10 -0.05109, 1820 (ln+1)UAP:Serumuricacid(mg/dL) 2.00E-10 -0.05109, 1820 (ln+1)UAPSI:Uric acid (umol/L) 2.00E-10 -0.06162, 1820 (ln+1)UAPSI:Uric acid (umol/L) 2.00E-10 -0.06162, 1820 (ln+1)UAPSI:Serumuricacid:SI(umol/L) 2.00E-10 -0.06162, 1820 UAP:Uric acid (mg/dL) 7.00E-10 -0.30729, 1820 UAP:Uric acid (mg/dL) 7.00E-10 -0.30729, 1820 UAP:Serumuricacid(mg/dL) 7.00E-10 -0.30729, 1820 UAPSI:Uric acid (umol/L) 7.00E-10 -18.2769, 1820 UAPSI:Uric acid (umol/L) 7.00E-10 -18.2769, 1820 NHANES III UAPSI:Serumuricacid:SI(umol/L) 7.00E-10 -18.2769 1820

(ln+1)LBDSUASI:Uric acid (umol/L) 4.49E-22 -0.07428, 3940 Uric acid levels, Urate 18327256 SLC2A9 rs7442295 Uric acid NHW LBDSUASI:Uric acid (umol/L) 1.91E-18 -21.2097, 3940 levels 18179890 (ln+1)LBXSUA:Uric acid (mg/dL) 1.98E-21 -0.06084, 3940 NHANES 9902 LBXSUA:Uric acid (mg/dL) 1.90E-18 -0.35657 3940

(ln+1)UAP:Uric acid (mg/dL) 2.85E-35 -0.06085, 6381 (ln+1)UAP:Uric acid (mg/dL) 2.85E-35 -0.06085, 6381 (ln+1)UAP:Serumuricacid(mg/dL) 2.85E-35 -0.06085, 6381 UAP:Uric acid (mg/dL) 2.67E-29 -0.35128, 6381 UAP:Uric acid (mg/dL) 2.67E-29 -0.35128, 6381 UAP:Serumuricacid(mg/dL) 2.67E-29 -0.35128, 6381 (ln+1)UAPSI:Uric acid (umol/L) 2.30E-36 -0.0743, 6381 (ln+1)UAPSI:Uric acid (umol/L) 2.30E-36 -0.0743, 6381 (ln+1)UAPSI:Serumuricacid:SI(umol/L) 2.30E-36 -0.0743, 6381 UAPSI:Uric acid (umol/L) 2.68E-29 -20.8957, 6381 NHANES UAPSI:Uric acid (umol/L) 2.68E-29 -20.8957, 6381 combined UAPSI:Serumuricacid:SI(umol/L) 2.68E-29 -20.8957, 6381

(ln+1)UAP:Uric acid (mg/dL) 2.00E-10 -0.05109, 1820 (ln+1)UAP:Uric acid (mg/dL) 2.00E-10 -0.05109, 1820 (ln+1)UAP:Serumuricacid(mg/dL) 2.00E-10 -0.05109, 1820 UAPSI:Uric acid (umol/L) 2.00E-10 -0.06162, 1820 UAPSI:Uric acid (umol/L) 2.00E-10 -0.06162, 1820 UAPSI:Serumuricacid:SI(umol/L) 2.00E-10 -0.06162, 1820 Uric acid levels, Urate 18327256 (ln+1)UAP:Uric acid (mg/dL) 7.00E-10 -0.30729, 1820 SLC2A9 rs7442295 Uric acid MA (ln+1)UAP:Uric acid (mg/dL) 7.00E-10 -0.30729, 1820 levels 18179890 (ln+1)UAP:Serumuricacid(mg/dL) 7.00E-10 -0.30729, 1820 UAPSI:Uric acid (umol/L) 7.00E-10 -18.2769, 1820 UAPSI:Uric acid (umol/L) 7.00E-10 -18.2769, 1820 NHANES III UAPSI:Serumuricacid:SI(umol/L) 7.00E-10 -18.2769 1820

(ln+1)LBDSUASI:Uric acid (umol/L) 1.19E-15 -0.07693, 1853 LBDSUASI:Uric acid (umol/L) 4.49E-13 -20.9544, 1853 (ln+1LBXSUA:Uric acid (mg/dL) 2.83E-15 -0.06271, 1853 NHANES 9902 LBXSUA:Uric acid (mg/dL) 4.49E-13 -0.3523 1853

169 (ln+1)UAP:Uric acid (mg/dL) 1.69E-24 -0.05741, 3673 (ln+1)UAP:Uric acid (mg/dL) 1.69E-24 -0.05741, 3673 (ln+1)UAP:Serumuricacid(mg/dL) 1.69E-24 -0.05741, 3673 UAP:Uric acid (mg/dL) 9.98E-22 -0.33271, 3673 UAP:Uric acid (mg/dL) 9.98E-22 -0.33271, 3673 UAP:Serumuricacid(mg/dL) 9.98E-22 -0.33271, 3673 (ln+1)UAPSI:Uric acid (umol/L) 7.46E-25 -0.0699, 3673 (ln+1)UAPSI:Uric acid (umol/L) 7.46E-25 -0.0699, 3673 (ln+1)UAPSI:Serumuricacid:SI(umol/L) 7.46E-25 -0.0699, 3673 UAPSI:Uric acid (umol/L) 1.01E-21 -19.7893, 3673 NHANES UAPSI:Uric acid (umol/L) 1.01E-21 -19.7893, 3673 combined UAPSI:Serumuricacid:SI(umol/L) 1.01E-21 -19.7893 3673

(ln+1)TGP:Triglyceride (mg/dL) (ln+1)TGP:Triglyceride (mg/dL) 8.86E-07 -0.1307, 1997 (ln+1)TGP:Serumtriglycerides(mg/dL) 8.86E-07 -0.1307, 1997 (ln+1)TGPSI:Triglyceride (mmol/L) 8.86E-07 -0.1307, 1997 (ln+1)TGPSI:Triglyceride (mmol/L) 1.20E-06 -0.08089, 1997 (ln+1)TGPSI:Serumtriglycerides:SI(mmol/ 1.20E-06 -0.08089, 1997 L) 1.20E-06 -0.08089, 1997 TGP:Triglyceride (mg/dL) 5.98E-05 -22.04334, 1997 TGP:Triglyceride (mg/dL) 5.98E-05 -22.04334, 1997 TGP:Serumtriglycerides(mg/dL) 5.98E-05 -22.04334, 1997 TGPSI:Triglyceride (mmol/L) 6.01E-05 -0.24881, 1997 TGPSI:Triglyceride (mmol/L) 6.01E-05 -0.24881, 1997 NHANES III TGPSI:Serumtriglycerides:SI(mmol/L) 6.01E-05 -0.24881, 1997

LBDSTRSI:Triglycerides (mmol/L) 1.00E-10 -0.5087 1841 LBXSTR:Triglycerides (mg/dL) 1.00E-10 -45.0598 1841 LBDTRSI:Triglyceride (mmol/L) 1.80E-09 -0.58105 931 APOA5 RS3135506 Triglycerides Triglycerides MA LBXTR:Triglyceride (mg/dL) 1.80E-09 -51.451 931 (ln+1)LBDSTRSI:Triglycerides (mmol/L) 2.30E-09 -0.10979 1841 (ln+1)LBXSTR:Triglycerides (mg/dL) 6.80E-09 -0.16827 1841 (ln+1)LBDTRSI:Triglyceride (mmol/L) 7.84E-08 -0.13064 931 NHANES 9902 (ln+1)LBXTR:Triglyceride (mg/dL) 4.26E-07 -0.19296 931

(ln+1)TGP:Triglyceride (mg/dL) (ln+1)TGP:Triglyceride (mg/dL) 3.74E-12 2928 (ln+1)TGP:Serumtriglycerides(mg/dL) 3.74E-12 2928 TGP:Triglyceride (mg/dL) 3.74E-12 -0.15137, 2928 TGP:Triglyceride (mg/dL) 1.00E-11 -0.15137, 2928 TGP:Serumtriglycerides(mg/dL) 1.00E-11 -0.15137, 2928 (ln+1)TGPSI:Triglyceride (mmol/L) 1.00E-11 -31.4213, 2928 (ln+1)TGPSI:Triglyceride (mmol/L) 1.43E-12 -31.4213, 2928 (ln+1)TGPSI:Serumtriglycerides:SI(mmol/ 1.43E-12 -31.4213, 2928 L) 1.43E-12 -0.09715, 2928 TGPSI:Triglyceride (mmol/L) 9.99E-12 -0.09715,- 2928 NHANES TGPSI:Triglyceride (mmol/L) 9.99E-12 0.09715,-0.35476,- 2928 combined TGPSI:Serumtriglycerides:SI(mmol/L) 9.99E-12 0.35476,-0.35476 2928

170 TGP:Triglyceride (mg/dL) TGP:Triglyceride (mg/dL) 2.34E-06 -31.74427, 2552 TGP:Serumtriglycerides(mg/dL) 2.34E-06 -31.74427, 2552 TGPSI:Triglyceride (mmol/L) 2.34E-06 -31.74427, 2552 TGPSI:Triglyceride (mmol/L) 2.34E-06 -0.35842, 2552 TGPSI:Serumtriglycerides:SI(mmol/L) 2.34E-06 -0.35842, 2552 (ln+1)TGPSI:Triglyceride (mmol/L) 2.34E-06 -0.35842, 2552 (ln+1)TGPSI:Triglyceride (mmol/L) 1.17E-05 -0.08783, 2552 (ln+1)TGPSI:Serumtriglycerides:SI(mmol/ 1.17E-05 -0.08783, 2552 L) 1.17E-05 -0.08783, 2552 (ln+1)TGP:Triglyceride (mg/dL) 3.46E-05 -0.13414, 2552 (ln+1)TGP:Triglyceride (mg/dL) 3.46E-05 -0.13414, 2552 NHANES III (ln+1)TGP:Serumtriglycerides(mg/dL) 3.46E-05 -0.13414 2552

LBDSTRSI:Triglycerides (mmol/L) 6.31E-08 -0.33577, 3935 APOA5 RS3135506 Triglycerides 20395964 Triglycerides NHW LBXSTR:Triglycerides (mg/dL) 6.31E-08 -29.7405, 3935 (ln+1)LBDSTRSI:Triglycerides (mmol/L) 1.40E-09 -0.09828, 3935 NHANES 9902 (ln+1)LBXSTR:Triglycerides (mg/dL) 5.30E-09 -0.15318 3935

(ln+1)TGPSI:Triglyceride (mmol/L) (ln+1)TGPSI:Triglyceride (mmol/L) 6.60E-09 -0.0878, 4495 (ln+1)TGPSI:Serumtriglycerides:SI(mmol/ 6.60E-09 -0.0878, 4495 L) 6.60E-09 -0.0878, 4495 (ln+1)TGP:Triglyceride (mg/dL) 2.75E-08 -0.13525, 4495 (ln+1)TGP:Triglyceride (mg/dL) 2.75E-08 -0.13525, 4495 (ln+1)TGP:Serumtriglycerides(mg/dL) 2.75E-08 -0.13525, 4495 TGP:Triglyceride (mg/dL) 4.76E-08 -29.5921, 4495 TGP:Triglyceride (mg/dL) 4.76E-08 -29.5921, 4495 TGP:Serumtriglycerides(mg/dL) 4.76E-08 -29.5921, 4495 TGPSI:Triglyceride (mmol/L) 4.78E-08 -0.33408, 4495 NHANES TGPSI:Triglyceride (mmol/L) 4.78E-08 -0.33408, 4495 combined TGPSI:Serumtriglycerides:SI(mmol/L) 4.78E-08 -0.33408 4495

171 Appendix Table 2.3. PheWAS associations related to known results in the GWAS Catalog.

Previously Published Phenotype Closest (NHGRI GWAS Pubmed Phenotype Race/ Gene(s) SNP Catalog) ID Class Ethinicity Study Phenotype Description P-value Beta Sample Size 5.64e-07, -0.04352(0.00867), 2.82e-07, -8.9698(1.7408), - (ln + 1)TCP:Total cholesterol (mg/dL), TCP:Total 4.56e-07, 0.0365(0.00721), - cholesterol (mg/dL), (ln + 1)TCPSI:Total cholesterol 2.78e-07, 0.23205(0.04501), (mmol/L), TCPSI:Total cholesterol (mmol/L), (ln + 5.64e-07, -0.04352(0.00867), 1)TCP:Cholesterol, Total(mg/dL), TCP:Cholesterol, 2.82e-07, -8.9698(1.7408), - 2044, 2044, 2044, Total(mg/dL), (ln + 1)CHP:Serumcholesterol(mg/dL), 1.11e-06, 0.04217(0.00863), 2044, 2044, 2044, NHANES III CHP:Serumcholesterol(mg/dL) 3.69e-07 -8.86092(1.73712) 2038, 2038 2.66e-06, -0.03522(0.00748), (ln + 1)LBDSCHSI:Cholesterol (mmol/L), 1.95e-06, -0.22232(0.04658), LBDSCHSI:Cholesterol (mmol/L), (ln + 3.18e-06, -0.03469(0.00742), 1)LBDTCSI:Total cholesterol (mmol/L), 2.66e-06, -0.21964(0.04663), LBDTCSI:Total cholesterol (mmol/L), (ln + 3.02e-06, -0.04196(0.00896), 1)LBXSCH:Serum Cholesterol (mg/dL), 1.92e-06, -8.60376(1.801), - LBXSCH:Serum Cholesterol (mg/dL), (ln + 3.52e-06, 0.0413(0.00888), - 1860, 1860, 1860, LDL cholesterol, 1)LBXTC:Total cholesterol (mg/dL), LBXTC:Total 2.62e-06, 8.49997(1.8033), - 1860, 1860, 1860, 19060906, Total CELSR2 rs12740374 Coronary heart MA cholesterol (mg/dL), (ln + 1)LBXSCH:Cholesterol, total 3.02e-06, 0.04196(0.00896), 1860, 1860, 1860, 21347282 Cholesterol disease NHANES 9902 (mg/dL), LBXSCH:Cholesterol, total (mg/dL) 1.92e-06 -8.60376(1.801) 1860

(ln + 1)CHPSI:Serum cholesterol: SI (mmol/L), 8.84e-07, CHPSI:Serum cholesterol: SI (mmol/L), (ln + 3.69e-07, 1)CHPSI:Serumcholesterol:SI(mmol/L), 8.84e-07, -0.0355,-0.22915,- CHPSI:Serumcholesterol:SI(mmol/L) (ln+1)TCP:Total 3.69e-07 0.0355,-0.22915,- 2038, 2038, 2038, cholesterol (mg/dL) 1.47E-12 0.04416, 2038 3904 (ln+1)TCP:Total cholesterol (mg/dL) 1.47E-12 -0.04416, 3904 (ln+1)TCP:Cholesterol,Total(mg/dL) 1.47E-12 -0.04416, 3904 TCP:Total cholesterol (mg/dL) 6.10E-13 -9.05577, 3904 TCP:Total cholesterol (mg/dL) 6.10E-13 -9.05577, 3904 TCP:Cholesterol,Total(mg/dL) 6.10E-13 -9.05577, 3904 (ln+1)TCPSI:Total cholesterol (mmol/L) 1.06E-12 -0.03704, 3904 (ln+1)TCPSI:Total cholesterol (mmol/L) 1.06E-12 -0.03704, 3904 NHANES TCPSI:Total cholesterol (mmol/L) 6.10E-13 -0.23416, 3904 combined TCPSI:Total cholesterol (mmol/L) 6.10E-13 -0.23416 3904 0.000114189, -0.03004(0.00777), 0.000183569, -5.82612(1.55471), (ln + 1)TCP:Total cholesterol (mg/dL), TCP:Total 2.37e-05, -0.02768(0.00653), cholesterol (mg/dL), (ln + 1)TCPSI:Total cholesterol 0.000114189, -0.03004(0.00777), (mmol/L), (ln + 1)TCP:Cholesterol, Total(mg/dL), 0.000183569, -5.82612(1.55471), TCP:Cholesterol, Total(mg/dL), TCPSI:Total cholesterol 4.62e-05, -0.16877(0.04134), (mmol/L), HAE7:Doctor told you - high cholesterol 0.0042, 0.69(0.1296), - 2073, 2073, 2077, level, (ln + 1)CHP:Serumcholesterol(mg/dL), 2.17e-05, 0.03325(0.00781), 2073, 2073, 2077, LDL cholesterol, CHP:Serumcholesterol(mg/dL), 4.62e-05, -6.52646(1.59847), 868, 2077, 2077, 19060906, Total CELSR2 rs12740374 Coronary heart NHB NHANES III HAE7:Doctortoldbloodcholesterollevelhigh 0.0042 0.69(0.1296) 868 21347282 Cholesterol disease (ln + 1)LBDSCHSI:Cholesterol (mmol/L), 3.84e-05, -0.0316(0.00765), - LBDSCHSI:Cholesterol (mmol/L), (ln + 4.57e-05, 0.19104(0.04671), 1)LBDTCSI:Total cholesterol (mmol/L), 5.12e-05, -0.03089(0.0076), - LBDTCSI:Total cholesterol (mmol/L), (ln + 5.92e-05, 0.18823(0.04672), 1)LBXSCH:Serum Cholesterol (mg/dL), 3.72e-05, -0.0381(0.00921), - LBXSCH:Serum Cholesterol (mg/dL), (ln + 4.54e-05, 7.38939(1.80597), 1319, 1319, 1320, 1)LBXTC:Total cholesterol (mg/dL), LBXTC:Total 4.99e-05, -0.03712(0.00912), 1320, 1319, 1319, cholesterol (mg/dL), (ln + 1)LBXSCH:Cholesterol, total 5.92e-05, -7.27938(1.80668), 1320, 1320, 1319, NHANES 9902 (mg/dL), LBXSCH:Cholesterol, total (mg/dL) 3.72e-05, -0.0381(0.00921), - 1319

172 4.54e-05 7.38939(1.80597)

2.37e-05, (ln + 1)CHPSI:Serum cholesterol: SI (mmol/L), 4.62e-05, CHPSI:Serum cholesterol: SI (mmol/L), (ln + 1.86e-08, 1)TCP:Total cholesterol (mg/dL), TCP:Total cholesterol 3.98e-08, (mg/dL), (ln + 1)TCPSI:Total cholesterol (mmol/L), 1.97e-08, TCPSI:Total cholesterol (mmol/L), (ln + 3.98e-08, -0.02768, -0.16877, 1)CHPSI:Serumcholesterol:SI(mmol/L), 2.37e-05, -0.03355, -6.52407, 2077, 2077, 3393, CHPSI:Serumcholesterol:SI(mmol/L), (ln + 4.62e-05, -0.02791, -0.16872, 3393, 3393, 3393, NHANES 1)TCP:Cholesterol, Total(mg/dL), TCP:Cholesterol, 1.86e-08, -0.02768, -0.16877, 2077, 2077, 3393, combined Total(mg/dL) 3.98e-08 -0.03355, -6.52407 3393 -0.03574(0.00753), 2.21e-06, -7.3677(1.54148), - (ln + 1)TCP:Total cholesterol (mg/dL), TCP:Total 1.85e-06, 0.03574(0.00753), cholesterol (mg/dL), (ln + 1)TCP:Cholesterol, 2.21e-06, -7.3677(1.54148), - Total(mg/dL), TCP:Cholesterol, Total(mg/dL), (ln + 1.85e-06, 0.03166(0.00636), 1)TCPSI:Total cholesterol (mmol/L), TCPSI:Total 6.82e-07, -0.20554(0.04094), cholesterol (mmol/L), HAE7:Doctor told you - high 5.5e-07, 1e- 0.689(0.0915), - 2595, 2595, 2595, cholesterol level, (ln + 1)CHP:Serumcholesterol(mg/dL), 06, 7.14e-07, 0.0376(0.00757), - 2595, 2593, 2593, CHP:Serumcholesterol(mg/dL), 5.49e-07, 1e- 7.94839(1.58305), 1692, 2593, 2593, NHANES III HAE7:Doctortoldbloodcholesterollevelhigh 06 0.689(0.0915) 1692 9.39e-07, -0.02215(0.00451), (ln + 1)LBDSCHSI:Cholesterol (mmol/L), 9.53e-07, -0.13963(0.02845), LBDSCHSI:Cholesterol (mmol/L), (ln + 2.85e-07, -0.02322(0.00452), 1)LBDTCSI:Total cholesterol (mmol/L), 3.01e-07, -0.14834(0.02891), LBDTCSI:Total cholesterol (mmol/L), (ln + 9.95e-07, -0.02638(0.00538), 1)LBXSCH:Serum Cholesterol (mg/dL), 9.39e-07, -5.40219(1.09982), LDL cholesterol, LBXSCH:Serum Cholesterol (mg/dL), (ln + 3.04e-07, -0.02757(0.00538), 3956, 3956, 3958, 19060906, Total CELSR2 rs12740374 Coronary heart NHW 1)LBXTC:Total cholesterol (mg/dL), LBXTC:Total 3.04e-07, -5.73379(1.11778), 3958, 3956, 3956, 21347282 Cholesterol disease cholesterol (mg/dL), (ln + 1)LBXSCH:Cholesterol, total 9.95e-07, -0.02638(0.00538), 3958, 3958, 3956, NHANES 9902 (mg/dL), LBXSCH:Cholesterol, total (mg/dL) 9.39e-07 -5.40219(1.09982) 3956

(ln + 1)CHPSI:Serum cholesterol: SI (mmol/L), 6.82e-07, CHPSI:Serum cholesterol: SI (mmol/L), (ln + 5.5e-07, 1)CHPSI:Serumcholesterol:SI(mmol/L), 6.82e-07, -0.03166,- CHPSI:Serumcholesterol:SI(mmol/L) TCP:Total 5.5e-07 0.20554,-0.03166,- 2593, 2593, 2593, cholesterol (mg/dL) 6.10E-12 0.20554,-0.03026, 2593 6553 TCP:Total cholesterol (mg/dL) 6.10E-12 -0.03026, 6553 TCP:Cholesterol,Total(mg/dL) 6.10E-12 -0.03026, 6553 TCP:Total cholesterol (mg/dL) 4.40E-12 -6.29047, 6553 TCP:Total cholesterol (mg/dL) 4.40E-12 -6.29047, 6553 TCP:Cholesterol,Total(mg/dL) 4.40E-12 -6.29047, 6553 TCPSI:Total cholesterol (mmol/L) 5.31E-12 -0.02545, 6553 TCPSI:Total cholesterol (mmol/L) 5.31E-12 -0.02545, 6553 NHANES TCPSI:Total cholesterol (mmol/L) 4.27E-12 -0.16274, 6553 combined TCPSI:Total cholesterol (mmol/L) 4.27E-12 -0.16274 6553 0.00035201, -0.02283(0.00638), (ln + 1)TCP:Total cholesterol (mg/dL), TCP:Total 0.001385895, -4.18811(1.30835), cholesterol (mg/dL), (ln + 1)TCPSI:Total cholesterol 0.000306904, -0.01947(0.00539), Total (mmol/L), TCPSI:Total cholesterol (mmol/L), (ln + 0.001115616, -0.11338(0.03474), DOCK7 rs1748195 Triglycerides 18193043 NHW Cholesterol 1)CHP:Serumcholesterol(mg/dL), 0.000238577, -0.02357(0.00641), CHP:Serumcholesterol(mg/dL), (ln + 0.001116754, -4.38398(1.34354), 2575, 2575, 2573, 1)TCP:Cholesterol, Total(mg/dL), TCP:Cholesterol, 0.00035201, -0.02283(0.00638), 2573, 2573, 2573, NHANES III Total(mg/dL) 0.001385895 -4.18811(1.30835) 2575, 2575

173 (ln + 1)LBDSCHSI:Cholesterol (mmol/L), (ln + 0.00859936, -0.01049(0.00399), 1)LBDTCSI:Total cholesterol (mmol/L), 0.00527186, -0.01114(0.00399), LBDTCSI:Total cholesterol (mmol/L), (ln + 0.00694105, -0.06909(0.02558), 1)LBXSCH:Serum Cholesterol (mg/dL), (ln + 0.00861798, -0.01251(0.00476), 1)LBXTC:Total cholesterol (mg/dL), LBXTC:Total 0.00520943, -0.01328(0.00475), 3965, 3967, 3967, cholesterol (mg/dL), (ln + 1)LBXSCH:Cholesterol, total 0.00687426, -2.675(0.98917), - 3965, 3967, 3967, NHANES 9902 (mg/dL) 0.00861798 0.01251(0.00476) 3965 0.000306904, (ln + 1)CHPSI:Serum cholesterol: SI (mmol/L), 0.00111562, CHPSI:Serum cholesterol: SI (mmol/L), (ln + 6.73e-06, 1)TCP:Total cholesterol (mg/dL), TCP:Total cholesterol 3.04e-05, (mg/dL), (ln + 1)TCPSI:Total cholesterol (mmol/L), 8.21e-06, TCPSI:Total cholesterol (mmol/L), (ln + 3.03e-05, -0.01947, -0.11338, 1)CHPSI:Serumcholesterol:SI(mmol/L), 0.000306904, -0.01724, -3.30184, 2573, 2573, 6542, CHPSI:Serumcholesterol:SI(mmol/L), (ln + 0.00111562, -0.01433, -0.08541, 6542, 6542, 6542, NHANES 1)TCP:Cholesterol, Total(mg/dL), TCP:Cholesterol, 6.73e-06, -0.01947, -0.11338, 2573, 2573, 6542, combined Total(mg/dL) 3.04e-05 -0.01724, -3.30184 6542 0.000176921, 0.02762(0.00735), (ln + 1)TCP:Total cholesterol (mg/dL), TCP:Total 6.45e-05, 6.02735(1.50596), cholesterol (mg/dL), (ln + 1)TCPSI:Total cholesterol 0.000150846, 0.02337(0.00616), (mmol/L), TCPSI:Total cholesterol (mmol/L), (ln + 6.45e-05, 0.15587(0.03894), 1)CHP:Serumcholesterol(mg/dL), 0.000451418, 0.02597(0.00739), CHP:Serumcholesterol(mg/dL), (ln + 0.000215228, 5.73791(1.54837), 2553, 2553, 2553, 1)TCP:Cholesterol, Total(mg/dL), TCP:Cholesterol, 0.000176921, 0.02762(0.00735), 2553, 2551, 2551, NHANES III Total(mg/dL) 6.45e-05 6.02735(1.50596) 2553, 2553 0.00103924, 0.01524(0.00464), (ln + 1)LBDSCHSI:Cholesterol (mmol/L), 0.00114912, 0.09545(0.02934), LBDSCHSI:Cholesterol (mmol/L), (ln + 0.00129991, 0.01496(0.00465), 1)LBDTCSI:Total cholesterol (mmol/L), 0.00135048, 0.09553(0.02979), LBDTCSI:Total cholesterol (mmol/L), (ln + 0.00101053, 0.01823(0.00554), Total 1)LBXSCH:Serum Cholesterol (mg/dL), 0.0011311, 3.69558(1.13427), LIPC rs1800588 HDL cholesterol 18193044 NHW Cholesterol LBXSCH:Serum Cholesterol (mg/dL), (ln + 0.00133153, 0.01776(0.00553), 3942, 3942, 3944, 1)LBXTC:Total cholesterol (mg/dL), LBXTC:Total 0.00136371, 3.69094(1.15179), 3944, 3942, 3942, cholesterol (mg/dL), (ln + 1)LBXSCH:Cholesterol, total 0.00101053, 0.01823(0.00554), 3944, 3944, 3942, NHANES 9902 (mg/dL), LBXSCH:Cholesterol, total (mg/dL) 0.0011311 3.69558(1.13427) 3942 0.000401319, (ln + 1)CHPSI:Serum cholesterol: SI (mmol/L), 0.000215002, CHPSI:Serum cholesterol: SI (mmol/L), (ln + 1.91e-06, 1)TCP:Total cholesterol (mg/dL), TCP:Total cholesterol 7.36e-07, (mg/dL), (ln + 1)TCPSI:Total cholesterol (mmol/L), 1.6e-06, TCPSI:Total cholesterol (mmol/L), (ln + 7.28e-07, 0.02202, 0.14839, 1)CHPSI:Serumcholesterol:SI(mmol/L), 0.000401319, 0.02113, 4.5417, 2551, 2551, 6497, CHPSI:Serumcholesterol:SI(mmol/L), (ln + 0.000215002, 0.01787, 0.1175, 6497, 6497, 6497, NHANES 1)TCP:Cholesterol, Total(mg/dL), TCP:Cholesterol, 1.91e-06, 0.02202, 0.14839, 2551, 2551, 6497, combined Total(mg/dL) 7.36e-07 0.02113, 4.5417 6497 2.92e-05, 0.07783(0.01859), (ln + 1)TGP:Triglyceride (mg/dL), TGP:Triglyceride 0.000521478, 13.41818(3.86256), (mg/dL), (ln + 1)TGPSI:Triglyceride (mmol/L), 8.52e-05, 0.04527(0.0115), TGPSI:Triglyceride (mmol/L), (ln + 0.000523806, 0.15144(0.04361), 1)TGP:Serumtriglycerides(mg/dL), 2.92e-05, 0.07783(0.01859), TGP:Serumtriglycerides(mg/dL), (ln + 0.000521478, 13.41818(3.86256), 2553, 2553, 2553, LIPC rs1800588 HDL cholesterol 18193044 Triglycerides NHW 1)TGPSI:Serumtriglycerides:SI(mmol/L), 8.52e-05, 0.04527(0.0115), 2553, 2553, 2553, NHANES III TGPSI:Serumtriglycerides:SI(mmol/L) 0.000523806 0.15144(0.04361) 2553, 2553 (ln + 1)LBDSTRSI:Triglycerides (mmol/L), (ln + 0.000679462, 0.03257(0.00958), 1)LBDTRSI:Triglyceride (mmol/L), (ln + 0.000616379, 0.04611(0.01344), 1)LBXSTR:Triglycerides (mg/dL), (ln + 0.000188324, 0.05792(0.0155), 3942, 1944, 3942, NHANES 9902 1)LBXTR:Triglyceride (mg/dL) 0.000403743 0.07594(0.02143) 1944

174 1.35e-07, (ln + 1)TGP:Triglyceride (mg/dL), TGP:Triglyceride 0.00010001, (mg/dL), (ln + 1)TGPSI:Triglyceride (mmol/L), 4.96e-07, TGPSI:Triglyceride (mmol/L), (ln + 0.000100648, 1)TGP:Serumtriglycerides(mg/dL), 1.35e-07, 0.07417, 12.1929, TGP:Serumtriglycerides(mg/dL), (ln + 0.00010001, 0.044, 0.13761, 4497, 4497, 4497, NHANES 1)TGPSI:Serumtriglycerides:SI(mmol/L), 4.96e-07, 0.07417, 12.1929, 4497, 4497, 4497, combined TGPSI:Serumtriglycerides:SI(mmol/L) 0.000100648 0.044, 0.13761 4497, 4497

0.004593853, 0.02215(0.00781), (ln + 1)TCP:Total cholesterol (mg/dL), TCP:Total 0.002448706, 4.85279(1.6002), cholesterol (mg/dL), (ln + 1)TCPSI:Total cholesterol 0.004180783, 0.01875(0.00654), (mmol/L), TCPSI:Total cholesterol (mmol/L), (ln + 0.002457867, 0.12545(0.04138), 1)CHP:Serumcholesterol(mg/dL), 0.009399236, 0.02042(0.00786), CHP:Serumcholesterol(mg/dL), (ln + 0.006260315, 4.5053(1.64662), 2569, 2569, 2569, 1)TCP:Cholesterol, Total(mg/dL), TCP:Cholesterol, 0.004593853, 0.02215(0.00781), 2569, 2567, 2567, NHANES III Total(mg/dL) 0.002448706 4.85279(1.6002) 2569, 2569 1.15e-05, 0.02164(0.00493), (ln + 1)LBDSCHSI:Cholesterol (mmol/L), 2.22e-05, 0.13212(0.03111), LBDSCHSI:Cholesterol (mmol/L), (ln + 1.13e-05, 0.02165(0.00493), 1)LBDTCSI:Total cholesterol (mmol/L), 1.84e-05, 0.13542(0.03157), APOB - 19060906, Total LBDTCSI:Total cholesterol (mmol/L), (ln + 1.09e-05, 0.02589(0.00588), rs515135 LDL cholesterol NHW KLHL29 20864672 Cholesterol 1)LBXSCH:Serum Cholesterol (mg/dL), 2.25e-05, 5.10482(1.20296), LBXSCH:Serum Cholesterol (mg/dL), (ln + 1.06e-05, 0.02584(0.00586), 3957, 3957, 3959, 1)LBXTC:Total cholesterol (mg/dL), LBXTC:Total 1.81e-05, 5.24109(1.22086), 3959, 3957, 3957, cholesterol (mg/dL), (ln + 1)LBXSCH:Cholesterol, total 1.09e-05, 0.02589(0.00588), 3959, 3959, 3957, NHANES 9902 (mg/dL), LBXSCH:Cholesterol, total (mg/dL) 2.25e-05 5.10482(1.20296) 3957 0.0087631, (ln + 1)CHPSI:Serum cholesterol: SI (mmol/L), 0.00626111, CHPSI:Serum cholesterol: SI (mmol/L), (ln + 1.67e-07, 1)TCP:Total cholesterol (mg/dL), TCP:Total cholesterol 1.39e-07, (mg/dL), (ln + 1)TCPSI:Total cholesterol (mmol/L), 1.59e-07, TCPSI:Total cholesterol (mmol/L), (ln + 1.42e-07, 0.01732, 0.11651, 1)CHPSI:Serumcholesterol:SI(mmol/L), 0.0087631, 0.02464, 5.12808, 2567, 2567, 6528, CHPSI:Serumcholesterol:SI(mmol/L), (ln + 0.00626111, 0.02071, 0.13253, 6528, 6528, 6528, NHANES 1)TCP:Cholesterol, Total(mg/dL), TCP:Cholesterol, 1.67e-07, 0.01732, 0.11651, 2567, 2567, 6528, combined Total(mg/dL) 1.39e-07 0.02464, 5.12808 6528 0.008196822, -0.02051(0.00775), (ln + 1)TCP:Total cholesterol (mg/dL), TCP:Total 0.004537427, -4.51224(1.58845), cholesterol (mg/dL), (ln + 1)TCPSI:Total cholesterol 0.007513933, -0.01737(0.00649), (mmol/L), TCPSI:Total cholesterol (mmol/L), (ln + 0.004548083, -0.11666(0.04108), 1)TCP:Cholesterol, Total(mg/dL), TCP:Cholesterol, 0.008196822, -0.02051(0.00775), 2577, 2577, 2577, NHANES III Total(mg/dL) 0.004537427 -4.51224(1.58845) 2577, 2577, 2577 1.95e-05, -0.021(0.00491), - (ln + 1)LBDSCHSI:Cholesterol (mmol/L), 3.92e-05, 0.12776(0.03103), LBDSCHSI:Cholesterol (mmol/L), (ln + 1.83e-05, -0.02107(0.00491), 1)LBDTCSI:Total cholesterol (mmol/L), 3.26e-05, -0.131(0.03149), - APOB - 18262040, Total RS562338 LDL cholesterol NHW LBDTCSI:Total cholesterol (mmol/L), (ln + 1.79e-05, 0.02516(0.00586), KLHL29 18193043 Cholesterol 1)LBXSCH:Serum Cholesterol (mg/dL), 3.96e-05, -4.93668(1.19984), LBXSCH:Serum Cholesterol (mg/dL), (ln + 1.66e-05, -0.02519(0.00584), 3948, 3948, 3950, 1)LBXTC:Total cholesterol (mg/dL), LBXTC:Total 3.2e-05, -5.07033(1.21787), 3950, 3948, 3948, cholesterol (mg/dL), (ln + 1)LBXSCH:Cholesterol, total 1.79e-05, -0.02516(0.00586), 3950, 3950, 3948, NHANES 9902 (mg/dL), LBXSCH:Cholesterol, total (mg/dL) 3.96e-05 -4.93668(1.19984) 3948 4.83e-07, (ln + 1)TCP:Total cholesterol (mg/dL), TCP:Total 4.58e-07, cholesterol (mg/dL), (ln + 1)TCPSI:Total cholesterol 4.76e-07, (mmol/L), TCPSI:Total cholesterol (mmol/L), (ln + 4.66e-07, -0.02359, -4.8894, NHANES 1)TCP:Cholesterol, Total(mg/dL), TCP:Cholesterol, 4.83e-07, -0.01981, -0.12636, 6527, 6527, 6527, combined Total(mg/dL) 4.58e-07 -0.02359, -4.8894 6527, 6527, 6527

175 0.006229892, -0.02086(0.00762), (ln + 1)TCP:Total cholesterol (mg/dL), TCP:Total 0.005813327, -4.32606(1.56709), cholesterol (mg/dL), (ln + 1)TCPSI:Total cholesterol 0.005577447, -0.01786(0.00644), (mmol/L), TCPSI:Total cholesterol (mmol/L), (ln + 0.004983115, -0.11698(0.04162), 1)CHP:Serumcholesterol(mg/dL), 0.006019458, -0.02104(0.00765), CHP:Serumcholesterol(mg/dL), (ln + 0.004985988, -4.52309(1.60935), 2438, 2438, 2436, 1)TCP:Cholesterol, Total(mg/dL), TCP:Cholesterol, 0.006229892, -0.02086(0.00762), 2436, 2436, 2436, NHANES III Total(mg/dL) 0.005813327 -4.32606(1.56709) 2438, 2438 0.00576128, -0.01225(0.00443), (ln + 1)LBDSCHSI:Cholesterol (mmol/L), 0.00686923, -0.07566(0.02797), LBDSCHSI:Cholesterol (mmol/L), (ln + 0.00559637, -0.01231(0.00444), 1)LBDTCSI:Total cholesterol (mmol/L), 0.00688266, -0.07686(0.02843), C2orf43 - Total LBDTCSI:Total cholesterol (mmol/L), (ln + 0.00549281, -0.0147(0.00529), - rs7557067 Triglycerides 19060906 NHW 1)LBXSCH:Serum Cholesterol (mg/dL), 0.00690845, 2.92321(1.08161), APOB Cholesterol LBXSCH:Serum Cholesterol (mg/dL), (ln + 0.00529182, -0.01475(0.00528), 3964, 3964, 3966, 1)LBXTC:Total cholesterol (mg/dL), LBXTC:Total 0.00691472, -2.97045(1.09922), 3966, 3964, 3964, cholesterol (mg/dL), (ln + 1)LBXSCH:Cholesterol, total 0.00549281, -0.0147(0.00529), - 3966, 3966, 3964, NHANES 9902 (mg/dL), LBXSCH:Cholesterol, total (mg/dL) 0.00690845 2.92321(1.08161) 3964 0.00557745, (ln + 1)CHPSI:Serum cholesterol: SI (mmol/L), 0.00498312, CHPSI:Serum cholesterol: SI (mmol/L), (ln + 0.0001115, 1)TCP:Total cholesterol (mg/dL), TCP:Total cholesterol 0.000133719, (mg/dL), (ln + 1)TCPSI:Total cholesterol (mmol/L), 0.000111244, TCPSI:Total cholesterol (mmol/L), (ln + 0.000132141, -0.01786, -0.11698, 1)CHPSI:Serumcholesterol:SI(mmol/L), 0.00557745, -0.01689, -3.45552, 2436, 2436, 6404, CHPSI:Serumcholesterol:SI(mmol/L), (ln + 0.00498312, -0.01418, -0.08943, 6404, 6404, 6404, NHANES 1)TCP:Cholesterol, Total(mg/dL), TCP:Cholesterol, 0.0001115, -0.01786, -0.11698, 2436, 2436, 6404, combined Total(mg/dL) 0.000133719 -0.01689, -3.45552 6404

176 Appendix Table 2.4. Evaluation of differences in allele frequency for PheWAS-significant SNPs. NHB % Difference with SNP NHW_CAF MA % Difference with NHW NHW MA % Difference with NHW NHB % Difference with NHW rs2237895 C 0.55 21.83 2.17 24.52 rs4355801 G 1.67 35.15 3.96 36.93 rs646776 G 0.87 14.24 3.04 10.64 rs515135 A 0.79 26.78 0.18 29.89 rs12740374 T 1.22 5.40 5.29 1.76 rs3764261 T 2.42 0.09 0.12 1.57 rs7557067 G 2.58 7.09 3.11 8.88 rs1529729 G 5.32 16.12 8.91 16.26 rs13266634 T 4.47 21.42 3.27 22.10 rs2338104 C 6.69 19.30 6.67 18.36 rs1748195 G 6.18 29.75 8.30 32.47 rs780094 A 8.32 24.03 6.66 22.06 rs1260326 T 8.75 27.33 7.85 25.88 rs673548 A 4.59 2.36 5.49 0.42 rs562338 T 4.88 39.47 3.83 42.00 rs328 G 3.13 2.94 4.39 2.81 rs12678919 G 3.33 2.01 3.49 1.28 rs10503669 A 3.30 3.66 4.41 2.58 rs11206510 C 6.56 4.53 8.04 4.25 rs6855911 G 12.30 23.14 14.56 26.86 rs4420638 G 8.49 3.37 7.10 1.37 rs7442295 G 13.17 16.71 13.34 18.02 rs2231142 A 7.03 8.48 8.97 7.31 rs1800795 C 24.01 32.38 21.36 31.83 rs174547 C 26.98 25.13 24.35 23.27

177 rs964184 G 14.66 7.77 16.13 5.68 rs28927680 C 7.31 10.07 7.56 8.63 rs12286037 T 7.53 13.57 7.10 10.27 rs3135506 C 7.40 0.05 7.00 0.04 rs1800588 T 30.42 28.30 29.62 28.19

MA = Mexican Americans, NHB = Non Hispanic blacks, NHW = Non Hispanic whites CAF = coded allele frequency MA % Difference with NHW = absolute value (CAF MA - CAF NHW) * 100 NHB % Difference with NHW = absolute value (CAF NHW - CAF NHW) * 100

178 Appendix Table 4.1. All 83 replicating SNP-SNP models. Nearest genes and LRT p-value of discovery (DIS.LRT_Pval) and replication (REP.LRT_Pval) samples included. SNP1_SNP2_ID Gene1_Gene2 DIS.LRT_Pval REP.LRT_Pval rs2303436_rs9811074 DLAT_PDHB 0.000289527 0.0131365 rs735563_rs7644104 CSF2RB_CNTN4 0.000348106 0.04256 rs5756391_rs7644104 CSF2RB_CNTN4 0.000364521 0.0403438 rs9320004_rs527459 KIAA1468_PIGO 0.000399471 0.00367364 rs10789856_rs9811074 DIXDC1_PDHB 0.000426559 0.0178522 rs479658_rs9811074 DIXDC1_PDHB 0.000816167 0.0115644 rs3796947_rs6954351 EGF_EGFR 0.000817242 0.0251763 rs16833215_rs1905339 STAT4_PTRF 0.00085071 0.0232738 rs478941_rs9811074 DLAT_PDHB 0.000872142 0.0101332 rs627441_rs9811074 DLAT_PDHB 0.000872142 0.0135848 rs2250724_rs6954351 EGF_EGFR 0.000883585 0.0196154 rs10891303_rs9811074 DIXDC1_PDHB 0.000992213 0.0133633 rs3809033_rs9811074 DIXDC1_PDHB 0.000992867 0.0133633 rs9320004_rs634801 KIAA1468_PIGO 0.00102872 0.0116298 rs10891314_rs9811074 DLAT_PDHB 0.00104211 0.0106784 rs2075726_rs6776572 CSF2RB_CNTN4 0.00137447 0.035542 rs1366758_rs1414316 RHBDD1_IRS2 0.00150219 0.0165024 rs10923918_rs9267947 ADAM30_NOTCH4 0.00183126 0.0499972 rs1366758_rs12584136 RHBDD1_IRS2 0.0020021 0.0298185 rs1366758_rs9521510 RHBDD1_IRS2 0.00207091 0.00798813 rs11069721_rs8178108 LIG4_PRKDC 0.00216535 0.0101235 rs10789859_rs9811074 SDHD_PDHB 0.00260174 0.00840048 rs7242402_rs839721 BCL2_DNAH2 0.00282227 0.0122072 rs2178095_rs11899004 PRIMPOL_CASP8 0.00288793 0.016507 rs1011173_rs6037336 ACSBG1_EBF4 0.00312601 0.000386837 rs5027651_rs17169361 SMPDL3B_RPA3 0.00319329 0.0200536 rs17518446_rs3808755 EGFR_SH3GL2 0.00321173 0.0330731

179 rs3088378_rs9909163 RAD52_RPA1 0.00322942 0.0317027 rs1220825_rs8109050 CSN1S1_SULT2B1 0.00323375 0.0402104 rs11238349_rs12950752 EGFR_GRB2 0.00330117 0.0231303 rs4815270_rs11090327 ZNF343_UPB1 0.00359606 0.0427941 rs396746_rs4802533 LINC00336_RUVBL2 0.00378938 0.00341093 rs11780874_rs8110090 CYC1_TGFB1 0.00384994 0.0363343 rs12203582_rs133431 IL17F_MCM5 0.0038712 0.00217286 rs3124599_rs3132946 NOTCH1_NOTCH4 0.00414988 0.0382099 rs761167_rs133431 SLC25A20P1_MCM5 0.00415547 0.00152728 rs12584299_rs8178108 LIG4_PRKDC 0.00459895 0.0493178 rs2172159_rs8109050 CSN1S1_SULT2B1 0.00461696 0.0467234 rs4148215_rs12465555 ABCG8_LRPPRC 0.00471001 0.0158619 rs7004186_rs8110090 EXOSC4_TGFB1 0.00475148 0.0491674 rs6734836_rs9892996 ERBB4_GRB2 0.00484468 0.0165875 rs8014403_rs3025035 EIF2B2_VEGFA 0.00489991 0.0366007 rs10948691_rs133431 IL17F_MCM5 0.0049052 0.00221317 rs4148216_rs12465555 ABCG8_LRPPRC 0.00490917 0.0157697 rs191827_rs880375 PIGK_FOXN1 0.00519984 0.0127948 rs1366758_rs7999797 RHBDD1_IRS2 0.00521813 0.00290536 rs2458927_rs17590006 PRR5L_DDAH1 0.00526132 0.0336682 rs10744729_rs11651877 RAD52_RTN4RL1 0.00529914 0.0106755 rs5027651_rs4720750 SMPDL3B_RPA3 0.00536182 0.035481 rs2702185_rs2072710 SMG7_NCF4 0.00543407 0.0418295 rs133422_rs12472293 MCM5_MCM6 0.00559669 0.0330825 rs5030755_rs17137415 RPA1_RPA3OS 0.00579949 0.0232215 rs7300444_rs11651877 WNK1_RTN4RL1 0.00588652 0.0107771 rs17382610_rs17834873 C1orf52_CARD11 0.00588732 0.0166777 rs2740764_rs2208485 EGFR_SH3GL2 0.0059177 0.0488183 rs11152337_rs634801 KIAA1468_PIGO 0.00596114 0.0376816

180 rs7325027_rs17548270 LIG4_TMEM169 0.00636882 0.00109526 rs10499186_rs2331902 STX7_STX6 0.00639479 0.0251586 rs2838552_rs3740607 C21orf2_PITRM1 0.00661637 0.0227595 rs1366758_rs9521509 RHBDD1_IRS2 0.00676537 0.00139223 rs1505359_rs9892996 ERBB4_GRB2 0.00688965 0.00279412 rs1394781_rs9892996 ERBB4_GRB2 0.00703155 0.00804606 rs948413_rs4947978 C1QTNF5_EGFR 0.00703442 0.0446713 rs2843027_rs405875 NOTCH2P1_NOTCH4 0.00714527 0.00658144 rs10951757_rs6592577 RNA5SP230_POLD3 0.00732436 0.0265863 rs8103483_rs11069806 INSR_IRS2 0.00732635 0.0140684 rs12597188_rs11564445 CDH1_CTNNB1 0.00762653 0.000535081 rs867185_rs2706399 NBN_IL5 0.00785855 0.0420897 rs1047417_rs2877260 CBL_EGFR 0.0079112 0.0411596 rs11133360_rs4714699 KDR_VEGFA 0.00792739 0.00352063 rs133422_rs730005 MCM5_LCT 0.0079653 0.0294442 rs1800454_rs241430 TAP2_TAP2 0.00800121 0.0462247 rs11152337_rs527459 KIAA1468_PIGO 0.00817881 0.00836045 rs11670375_rs6538683 EML2_CCDC38 0.0082539 0.0393634 rs3759001_rs10892234 MPZL2_CD3E 0.00840048 0.00373131 rs9948708_rs527459 KIAA1468_PIGO 0.00855438 0.0229877 rs4333645_rs2025072 TMEM249_CPSF2 0.0086389 0.000466599 rs2843027_rs3130315 NOTCH2P1_NOTCH4 0.00878407 0.0065458 rs2843027_rs3115573 NOTCH2P1_NOTCH4 0.00895007 0.00651655 rs16833215_rs744166 STAT4_STAT3 0.00911554 0.0251692 rs210139_rs6060599 BAK1_BCL2L1 0.00917683 0.00505784 rs882019_rs10899035 MYL7_CHRDL2 0.0095287 0.0450023 rs2115386_rs9808105 INSR_MIR5702 0.0096892 0.0281701

181 Appendix Figure 5.1. Impact of MAF on alpha value.

Appendix Figure 5.1. Expected null distribution of the alpha value by minor allele frequency.

182 Appendix Figure 5.3. Power plots at 10% and 50% minor allele frequency.

183 VITA

EDUCATION

PhD 2015 Pennsylvania State University (Biochemistry & Molecular Biology)

MS 2011 Columbia University (Neuroscience & Education)

BS 2005 Cornell University (Human Development)

PROFESSIONAL EMPLOYMENT

Research Positions

2011-Present Graduate Research Assistant – Ritchie Laboratory

Department of Biochemistry & Molecular Biology

Pennsylvania State University

2009-2010 Research Assistant – Venkatesh Laboratory

Center for Human Genetics Research

City College

2009-2010 Research Assistant – Gordon Laboratory

Center for Cerebral Palsy Research

Columbia University

2004-2005 Research Assistant - Temple Laboratory

Department of Human Development

Cornell University

2002-2004 Research Assistant – Interactive Media Laboratory

Dartmouth-Hitchcock Medical Center

Dartmouth College