ENVIRONMENT-WIDE ASSOCIATIONS TO DISEASE AND DISEASE- RELATED PHENOTYPES

A DISSERTATION SUBMITTED TO THE PROGRAM IN BIOMEDICAL INFORMATICS AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY

Chirag Jagdish Patel August 2011

© 2011 by Chirag Jagdish Patel. All Rights Reserved. Re-distributed by Stanford University under license with the author.

This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/

This dissertation is online at: http://purl.stanford.edu/mg775gw7130

ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Atul Butte, Primary Adviser

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Jayanta Bhattacharya

I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy.

Mark Cullen

Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education

This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives.

iii ABSTRACT

Common diseases arise out of combination of both genetic and environmental influences. Advances in genomic technology have enabled investigators to create hypotheses regarding the contribution of genetic factors at a breathtaking pace. However, the assessment of multiple and specific environmental factors—and their interactions with the genome-- has not. We lack high-throughput analytic methodologies to comprehensively and systematically associate multiple physical and specific environmental factors, or the “envirome”, to disease and health.

We claim that the creation of hypotheses regarding the environmental contribution to disease is practicable through high-throughput analytic methods that have been well established in genomics. In the following dissertation, we develop and apply methods to systematically and comprehensively associate specific factors of the envirome with disease states, prioritizing factors for in- depth future study.

The current disciplines of studying the environmental determinants of health include toxicology and epidemiology, which operate on molecular and population scales, respectively. This dissertation proposes approaches in both of these disciplines. For example, we have developed a framework to conduct the first “Environment-wide Association Study” (EWAS), systematically associating environmental factors to disease on a population scale. We have applied this framework to investigate type 2 diabetes and heart disease on cohorts that are representative United States population, finding novel and robust associations in diverse and independent cohorts. Given the lack of explained risk resulting from current day genome-wide studies, the time is ripe to usher in a more comprehensive study of the environment, or “enviromics”, toward better understanding of multifactorial diseases and their prevention.

iv

ACKNOWLEDGEMENTS

Foremost, I thank my advisor, Dr. Atul Butte, for his undying confidence, inspiration, and guidance. Even just three years ago, it was far from my belief that the scientist whom I admired from afar would eventually take me on as a student and teach me how to compute, see, and enlighten. For Dr. Atul Butte’s supervision I am forever indebted and most fortunate.

I am also indebted to my dissertation committee, Drs. Jay Bhattacharya, Mark Cullen, John Ioannidis, and Robert Tibshirani. Much of this work has come out of discussions with these individuals and it is inspired by and stands on their fundamental teachings. I thank my academic advisors, Drs. Mark Musen and Betty Cheng, for encouraging me to keep taking courses that enabled this work.

I thank my many friends and colleagues in the Butte Laboratory and in the Biomedical Informatics program whom I continue to look up to and draw inspiration from. I feel honored and privileged to be among you. In particular, I thank Dr. Rong Chen, Alex Morgan, Joel Dudley, and Nick Tatonetti for providing support and encouragement when it was least expected but most needed.

From teaching me how to read and write and to gifting me the newest computers, I thank my parents, Neela and Jagdish Patel. I will always be grateful to them for initiating this most rewarding journey of lifelong learning.

I thank my brother, Ankur Patel, for his unflagging support and faith through thick and thin.

v I thank my in-laws, Tapan and Kokila Chaudhuri, for their support and encouragement.

I do not have the words to thank my partner in life, Trina Chaudhuri. I hope that I can some day enable her to achieve her aspirations as she has done for me.

I am grateful to the National Library of Medicine and Applied Biosystems, Inc. for financial support. I thank Centers for Disease Control and Prevention (CDC), the National Center for Health Statistics (NCHS), and the staff and individuals who take part in the National Health and Nutrition Examination Survey (NHANES). In particular, I thank Vijay Gambhir and Peter Meyer of the CDC/NCHS for their support in accessing and processing NHANES restricted genetic data. I am grateful again to Dr. Atul Butte for providing funds to access the NHANES restricted data. I thank the staff of the Biomedical Informatics Training program and the Butte Laboratory, Mary Jeanne Oliva, Susan Aptekar, Alex Skrenchuk, Dr. Russ Altman, and Dr. Larry Fagan. Without the support of these institutions and people, this work would have not been possible.

A portion of the work in this dissertation derives from two published articles and two articles currently in review for publication:

Chapter 2: 1. Patel, C. J. and A. J. Butte, Predicting environmental chemical factors associated with disease-related expression data. BMC Med Genomics, 2010. 3(1): p. 17.

vi Chapter 4: 2. Patel, C.J., J. Bhattacharya, and A.J. Butte, An Environment-Wide Association Study (EWAS) on type 2 diabetes mellitus. PLoS ONE, 2010. 5(5): p. e10746. 3. Patel, C.J., M. R. Cullen, J.P.A. Ioannidis, A.J. Butte, Non-genetic associations and correlation globes for determinants of lipid levels: an environment-wide association study. Submitted, 7/2011.

Chapter 5: 4. Patel, C.J., R. Chen, J.P.A. Ioannidis, A.J. Butte, Systematic identification of interaction effects between validated genome- and environment-wide associations on Type 2 Diabetes Mellitus. Submitted, 8/2011.

In the Chapter 2 work, I devised the methodology and wrote the manuscript with my advisor, Atul Butte. In the Chapter 4 work, I devised the “Environment-wide-Association Study” (EWAS) framework and carried out the analyses. For the EWAS on Type 2 Diabetes, I wrote the manuscripts with Jay Bhattacharya and Atul Butte. For the EWAS on serum lipid levels, I wrote and edited the manuscripts with Mark Cullen, John Ioannidis, and Atul Butte. Finally, in the Chapter 5 work, I devised the “Gene-Environment-Wide Association Study” (G-EWAS) framework and implemented the software to carry out the analyses. Rong Chen and Atul Butte provided the database of curated genetic information. I interpreted the data and wrote the manuscript with Rong Chen, John Ioannidis, and Atul Butte.

vii

TABLE OF CONTENTS

CHAPTER 1: INTRODUCING MULTI-DIMENSIONAL AND DATA- DRIVEN APPROACHES TO CREATE HYPOTHESES REGARDING ENVIRONMENTAL ASSOCIATIONS TO DISEASE ...... 1 What is the “Environment”? What is the “Envirome”? ...... 3 Creation of robust hypotheses connecting the environment, genome, and multifactorial disease ...... 12 Creating hypotheses comprehensively on a population scale ...... 14 Creating hypotheses comprehensively on a molecular or toxicological scale .... 18 Discussion ...... 21 CHAPTER 2. MAPPING MULTIPLE TOXICOLOGICAL RESPONSES TO COMPLEX DISEASE ...... 25 INTRODUCTION ...... 25 METHOD TO PREDICT ENVIRONMENTAL ASSOCIATION TO GENE EXPRESSION RESPONSE ...... 30 RESULTS ...... 41 Verification Phase ...... 42 Predicting Environmental Chemicals Associated with Cancer Data Sets ... 44 Clustering Significant Predictions by PubChem-derived Biological Activity ...... 54 DISCUSSION ...... 57 CHAPTER 3. METHODS TO EXECUTE ENVIRONMENT-WIDE ASSOCIATIONS ON DISEASE AND DISEASE-RELATED PHENOTYPES ON POPULATIONS...... 61 INTRODUCTION ...... 61 METHODS BACKGROUND ...... 63 Genome-wide association to disease ...... 63 Environment-wide association to disease ...... 65 Genetic versus non-genetic associations in population scaled studies ...... 68 EWAS METHOD ...... 72 Stage 1: Linear Modeling ...... 72 Stage 2: Controlling for Multiple Hypotheses by Estimating the False Discovery Rate ...... 74 Stage 3: Validation ...... 76 Stage 4: Sensitivity Analyses ...... 78 Stage 5: Correlation Globes ...... 80 DISCUSSION ...... 80 CHAPTER 4: ENVIRONMENT-WIDE ASSOCIATIONS TO DISEASE AND ADVERSE PHENOTYPES: APPLICATIONS TO TYPE 2 DIABETES (T2D) AND SERUM LIPID LEVELS ...... 83 INTRODUCTION ...... 83 ENVIRONMENT-WIDE ASSOCIATION STUDY ON TYPE 2 DIABETES 84 EWAS on T2D: Methods ...... 84 EWAS on T2D: Results ...... 91

viii EWAS on T2D: Conclusions ...... 100 ENVIRONMENT-WIDE ASSOCIATION STUDY ON SERUM LIPID LEVELS ...... 102 EWAS on Serum Lipids: Methods ...... 102 EWAS on Serum Lipids: Results ...... 107 EWAS on Serum Lipids: Conclusions ...... 123 DISCUSSION ...... 126 CHAPTER 5: TOWARD ENVIROME-GENOME INTERACTIONS IN THE CONTEXT OF HUMAN HEALTH: COMPREHENSIVELY SCREENING FOR GENE-ENVIRONMENT INTERACTIONS IN ASSOCIATION TO TYPE 2 DIABETES...... 130 INTRODUCTION ...... 130 Background ...... 131 Screening for Gene-Environment Interactions: “G-EWAS” ...... 133 METHODS ...... 135 Data and selected genetic and environmental factors ...... 136 Regression Analyses ...... 137 Multiplicity Correction and FDR ...... 138 RESULTS ...... 140 Allele frequencies ...... 141 Power Calculations ...... 141 Marginal Associations ...... 142 Correlation between genetic variants with environmental variables ...... 143 Screening for Genetic Variant by Environment Interactions ...... 144 Sensitivity Analyses limited to non-Hispanic white and other Hispanic participants and older individuals ...... 147 Limited Evidence to Support Interactions with Body Mass Index ...... 148 DISCUSSION ...... 149 CHAPTER 6: CONCLUSION AND DISCUSSION ...... 153 REFERENCES ...... 157

ix LIST OF TABLES

Table 1. Tentative categories of environmental factors as collected from MEDLINE MeSH terms...... 8 Table 2. Gene expression dataset summary for verification stage...... 37 Table 3. Chemical Prediction Results from the Verification Phase...... 43 Table 4. Prediction of environmental chemicals associated with prostate cancer samples (GSE6919)...... 50 Table 5. Prediction of environmental chemicals associated with lung cancer samples (GSE10072)...... 52 Table 6. Prediction of environmental chemicals associated with breast cancer samples (GSE6883)...... 53 Table 7. Highly statistically significant environmental factors associated to T2D found in more than one NHANES cohort...... 95 Table 8. Marginal association of each (n=18) or environmental factor (n=5) to T2D (FBG > 125 mg/dL)...... 143

x LIST OF FIGURES

Figure 1. Number of publications investigating genetics (red) or the environment (black) in MEDLINE from 1971 onward...... 3 Figure 2. Environmental factors investigated in context of disease sourced from MEDLINE...... 7 Figure 3. Envirome-disease network for WHO priority diseases ...... 10 Figure 4. “Zoomed-in” Envirome-disease network for WHO priority diseases...... 11 Figure 5. Overview of population- and molecular-scale methods to create hypotheses across the envirome and genome...... 13 Figure 6. Creation of the chemical-gene signatures based on the Comparative Toxicogenomics Database (CTD)...... 32 Figure 7. Creation of the ‘Envirome Map’ using CTD chemical-gene signatures...... 33 Figure 8. Predicting environmental chemical association to gene expression datasets...... 35 Figure 9. Clustering chemical prediction lists by biological activity archived in PubChem...... 41 Figure 10. Curated disease-chemical enrichment versus prediction lists for prostate cancer datasets...... 46 Figure 11. Curated disease-chemical enrichment versus prediction lists for lung cancer datasets...... 47 Figure 12. Curated disease-chemical enrichment versus prediction lists for breast cancer datasets...... 48 Figure 13. Chemical predictions for Prostate, Lung, and Breast Cancer datasets clustered by PubChem BioActivity...... 56 Figure 14. Sample data structure for EWAS...... 66 Figure 15. Method summary for EWAS on NHANES data...... 87 Figure 16. “Manhattan plot” style graphic showing the environment-wide association with T2D...... 93 Figure 17. “Manhattan plot” style graphic showing the environment-wide associations to lipid levels...... 109 Figure 18. Forest plot for top 12 validated environmental factors per cohort associated with triglycerides in a model adjusting for age, age-squared, SES, ethnicity, sex, BMI...... 111 Figure 19. Forest plot for validated environmental factors associated with LDL-C...... 112 Figure 20. Forest plot for top 12 validated environmental factors associated with HDL-C...... 113

xi Figure 21. Percent change in effect size (βfactor) between “original” and “extended” linear regression models predicting log10(triglycerides)...... 117

Figure 22. Percent change in effect size (βfactor) between “original” and “extended” linear regression models predicting log10(LDL-C). See Figure 21 for complete caption. Legend abbreviations: TFIBE: total fiber; TVC: total vitamin C; TCRYP: total cryptoxanthin; count: total supplement use in past 30 days; cardiovascular: on lipid lowering drug or had heart disease...... 118

Figure 23. Percent change in effect size (βfactor) between “original” and “extended” linear regression models predicting log10(HDL-C)...... 119 Figure 24. Pair-wise correlation globes for validated environmental and risk factors associated with triglycerides...... 121 Figure 25. Pair-wise correlation globes for validated environmental and risk factors associated with LDL-C...... 122 Figure 26. Pair-wise correlation globes for validated environmental and risk factors associated with HDL-C...... 123 Figure 27. Schematic for comprehensive testing and screening for gene- environment interactions against T2D...... 136 Figure 28 Power estimation for detection of interaction for each genetic locus and environmental factor pair tested against T2D (FBG > 125 mg/dL)...... 142 Figure 29. Manhattan plot of significance values of interaction term (-log10(p- value) for interaction term of pair of factors)...... 144 Figure 30. Per-risk allele effect sizes for top putative interactions with p < 0.05...... 147

xii CHAPTER 1: INTRODUCING MULTI-DIMENSIONAL AND DATA- DRIVEN APPROACHES TO CREATE HYPOTHESES REGARDING ENVIRONMENTAL ASSOCIATIONS TO DISEASE

Environmental factors -- “non-genetic” and often modifiable attributes such as diet, drugs, chemical pollutants, ecological processes, and infectious agents-- in addition to genetic factors, contribute to disease and health [1, 2]. Epidemiologists and toxicologists have been formally studying the contribution and how much disease risk both environmental and genetic factors impart on the population, for decades [3, 4].

For example, epidemiologists have measured the attributable fraction, or the fraction of disease that would be eliminated if the factor were to be eliminated. For some complex diseases, up to 70-90% of attributable risk can be attributed to differing “environments” [4, 5], defined here as a cumulative effect of specific factors. Genetic factors, on the other hand, may also play a large role; for example, heritability in obesity is estimated to range from 40-70% [6]. To determine what specific genetic or environmental factors contribute to disease, epidemiologists begin with an associative study in which variation for a factor is compared to disease status; for example, the presence of an environmental factor is compared in diseased versus undiseased individuals.

Despite both types of factors contributing to disease and health, many investigations since the 1990s have focused investigating the genetic factors (Figure 1). Recently, genetic association studies have advanced through a framework known as the “Genome-wide Association Study”, or GWAS. For example, the Wellcome Trust Case Control Consortium study of 7 common diseases [7] is a notable one. In GWAS, 100,000 to 1 million genetic factors are compared in frequency between diseased and non-diseased populations [8]

1 and to date, there have already been over 350 of these studies [9]. This is contrasted with environmental epidemiological studies, which at most investigate currently a handful of candidate factors at a time in association to disease or phenotype. Relatedly, we lack methods to comprehensively and systematically report environmental associations to disease [10]. Environmental epidemiological studies are neither data-driven nor multi- dimensional and therefore do not allow for discovery as is the case in genome- wide studies.

While epidemiology attempts to ascertain the contribution of factors on a population scale, toxicology uses model biological systems to study the influence of specific physical factors through assessment of molecular responses [3, 11]. While technology exists to measure these molecular responses on a genome-wide dimension [12], we lack methods to comprehensively connect these with human disease state. In fact, the National Academy of Sciences has called for a molecular-based effort to map comprehensively map toxicological findings with health and risk-associated phenotypes [13-15].

Our claim is that creation of hypotheses regarding the environmental associations to disease is possible through data-driven analytical methods that are standard in multi-dimensional genome-wide studies. To this end, we propose and implement 1.) An integrative approach to connect molecular and toxicological response to human disease states (Chapter 2), and 2.) A population-based study framework to correlate multiple environmental factors to disease, called an “Environment-wide Association Study” (EWAS) (Chapter 3 and 4), and furthermore, 3.) A method to predict how environmental factors interact with genetic variants through unbiased integration of EWAS and GWAS (Chapter 5). First, we introduce a concept that assumes a large subset of specific and possible environmental factors known as the “envirome”.

2 genetics environment 1200 1000 800 600 total total publications 400 200 0

1970 1980 1990 2000

year published

Figure 1. Number of publications investigating genetics (red) or the environment (black) in MEDLINE from 1971 onward. We queried MEDLINE for all articles that investigating World Health Organization priority diseases (Cardiovascular Disease, Coronary Disease, Hypertension, Type 2 Diabetes, Lung Cancer, Breast Cancer, Colon Cancer, Asthma, Compulsive Obstructive Pulmonary Disorder, Premature birth, Kidney Disease, and Alzheimer Disease) and tabulated those strictly studying either genetics or the environment. Only articles investigating either disease and genetics or disease and environment as a primary subject matter were considered. Since the 1990s, genetic factors have been studied more than environmental factors for WHO priority diseases.

What is the “Environment”? What is the “Envirome”? The “environment” is a loosely defined, heterogeneous mixture of non-genetic factors. For example, specific physical environmental factors include infectious agents (bacteria, viruses, fungi), dietary components (nutrients and vitamins), ultraviolet radiation, and non-nutrient chemicals (such as drugs or cigarette smoke). They may be by products of man-made processes, such as air pollutants, or of natural processes, such as toxins from animals and plants. Other types of environmental factors are non-“physical” – not associated with one concrete factor -- and are intertwined with behavior and life-style, such as stress and exercise. The environment also includes factors that arise as a

3 result of interaction between internal biological processes and other environmental factors, such as metabolites of environmental chemicals or infectious agents [16]. Air pollution, lead, tobacco smoke, ultraviolet radiation, occupational risk factors, and climate – in addition to infectious agents –are example factors prioritized by the World Health Organization for constant monitoring [17].

Environmental factors are unique in their routes of exposure, mode of measure, and dynamic. This heterogeneous mixture is contrasted with genetic variant factors assessed in GWAS, which are discrete and static units. The homogenous nature of genetic factors has been a factor in enabling standardization of measurement (e.g., polymerase chain reaction [18] and gene “chips” [19]) and organization (through efforts by the National Center of Biotechnology Information, NCBI [20]).

Despite these characteristics, we propose a concept that allows for the multi- dimensional study of the environment, the ”envirome”, the total “ensemble of the environment” [21] that can influence biological processes such as disease. While the “environment” refers to non-genetic factors in an abstract sense, we refer to the “envirome” in an analogous way as the genome, a collection of specific and varying factors.

To get a better grasp of the types of and specific factors investigated in the context of disease, we searched through all of the scientific and medical literature in MEDLINE up to year 2009 for evidence of investigation between an environmental factor and disease condition. This search is made possible using Medical Subject Headings (MeSH), an annotation system administered by the National Library of Medicine (NLM) and applied to all articles in MEDLINE [22, 23]. These subject headings contain categories of terms, such as diseases (or condition), chemicals, drugs and physiological attributes. These

4 sets of terms also have “qualifiers”, which contextualize relationships between terms; sample qualifiers include “etiology” (indicating an etiological relationship among a term corresponding to a condition and environmental factor), or “epidemiology” (indicating a population-based study between a condition and factor), and “drug therapy” (indicating an a therapeutic relationship studied between the condition and factor), among others.

Specifically, we went through each the MeSH annotations for each article and looked for terms with environmental factors and diseases, qualified by an etiological or epidemiological relationship. For example, an article examining the effects of smoking on incident type 2 diabetes (e.g., [24]) is annotated with the condition “Diabetes Mellitus, Type 2”, the specific environmental factor, “Smoking”, qualified by “etiology”. Because many different aspects of factors and conditions can be reported for a given article, we looked specifically for those factors and conditions that MEDLINE annotators deemed the “major”, or the main, subjects of the article. From this scraping of MEDLINE, we attained a comprehensive list of disease-environmental factor pairs investigated in the medical literature. Furthermore, we also attained the number of scientific publications that investigated each factor-disease pair. This indicated the degree or intensity to which a particular disease and factor has been studied.

We attempt to assemble the envirome into coherent categories of environmental factors using the MeSH annotations. Specifically, each factor in MeSH is described by a set of terms. For example, “Smoking” is described by the term “Individual Behavior” and “Lead” by the terms “Hazardous Substance” and “Element”. We manually categorized each factor based on these descriptors into 21 categories based on these descriptors (Table 1). There is notable overlap between factor categories involving drug and chemicals; for these factors, we opted to categorize all factors that are drugs as “drug” and non-drug environmental chemicals (such as pesticides or materials) in

5 chemically oriented bins (“organic chemical”, “inorganic chemical”, “element”).

The most highly investigated factor categories are drugs and medical procedures, comprising 24% and 29% of all factor-disease pairs respectively. In comparison, organic chemicals comprise of but 8% of all factor-disease relations. However, there are a large number of these organic chemicals in relation to all others (15%), suggesting an opportunity to explore these factors. Another opportunity lies in pinning down further composite factors lying in the most abstract categories, such as “chemical”; for example, air pollution is a complex matrix of specific factors belonging to other categories, each of which might have a distinct contribution to disease. The envirome is a complex and entity due to the heterogeneity of factors.

We acknowledge overlap and imbalance of this categorization; however, we stress that this first pass is exploratory and not definitive. We propose future work, the “envirome sequencing project”, focused on how exactly the envirome is defined and categorized, considering composition of physical factors (e.g., chemical structure), scope of biological effect (e.g., toxicological responses), potential routes of exposure, and modes of direct measurement in human tissue (e.g., cell assays, mass spectrometry, self-report). The next phase of such a project would follow in the footsteps of the HapMap project [25], characterizing how common environmental factors vary in different populations.

Our search resulted in a total of 89,653 unique disease-factor pairs assembled from 4977 unique environmental factors and 3189 unique disease conditions. Figure 2 shows the number of publications published for each factor versus the number of diseases investigated in context of that factor. Specifically, the median number of publications for a factor was 8; the most highly investigated

6 factor was “Smoking” (2416 publications) and 17% of all factors were only investigated once (1 publication). The median number of diseases investigated per factor was 6; the most highly investigated disease was “Dermatitis” (3353 publications), “Drug Eruptions” (3151 publications), and “Occupational Disease” (3105 publications). 11% of diseases have only been investigated once (1 publication). In general, the more a factor is investigated the more diseases it is investigated with (Figure 2). The most highly studied factor- disease pairs are seen in Table 1 and include factors such as Asbestos, UV Radiation, Smoking, and Air Pollution studied with diseases such as Lung Cancer, Skin Cancer, and Mesothelioma. 500 200 100 50 20 10 5 number of diseases for a factor of diseases number 2 1

1 5 10 50 100 500

number of publications for a factor

Figure 2. Environmental factors investigated in context of disease sourced from MEDLINE. Each point in the figure represents an environmental factor, where the x-axis represents the number of publications for the factor and the y-axis represents the number of diseases investigated for that factor. For example, the factor “Smoking” is the top right most point in the plot with over 2,416 publications investigating it among 560 diseases. The red line depicts median and grey lines depicts decile. Markers are faint and jittered to depict concentration of data.

7

Category Total Number of Top Factor and Disease number of factor-disease relationship factors relations (%) (%) Animal 7 (< 1%) 15 (< 1%) Birds (Bird Disease) (General) Bacteria 189 (4%) 857 (1%) E. coli (Diarrhea) Behavior 28 (<1%) 1101 (1%) Smoking (Lung neoplasms) Chemical 220 (4%) 4392 (5%) Air pollution (Asthma) (General) Dietary 290 (6%) 4912 (5%) Lipids (Cardiovascular disease) component Drug 1322 21071 (24%) Analgesics, Opioid (Pain) (27%) Element 134 (2%) 3229 (4%) Lead (Lead Poisoning) Eukaryotes 99 (2%) 218 (< 1%) Mites (Mite Infestations) Fungus 35 (<1%) 168 (< 1%) Candida (Candidiasis) Hormone 263 (5%) 5283 (5%) Estrogens (Breast neoplasms) Immune 38 (5%) 5826 (6%) Measles-Mumps-Rubella factors Vaccine (Autistic Disorder) Poisoning / 38 (<1%) 930 (1%) Occupational Exposure injury (Dermatitis) process Inorganic 145 (2%) 2554 (3%) Asbestos (Lung neoplasms) chemical Man-made 58 (1%) 1153 (1%) Laser (Eye injuries) object Nucleic 28 (<1%) 207 (<1%) RNA, viral (Hepatitis C) Acids Occupation 23 (<1%) 177 (<1%) Equipment design (Burns) Organic 737 (15%) 7318 (8%) Latex (Dermatitis) chemical Procedure 801 (16%) 26388 (29%) Radiotherapy (Neoplasms) Natural 73 (1%) 2244 (3%) UV Rays (Neoplasms) Process 83 (2%) 952 (1%) Streptokinase (Thrombosis) Virus 141 (3%) 658 (<1%) HIV (AIDS) Table 1. Tentative categories of environmental factors as collected from MEDLINE MeSH terms. Second column shows number of factors within that category and the percentage of all factors. Third column shown the number of disease relationships a category. Right-most column shows an example factor belonging to the category and a disease studied with that factor. Left-most column is color key for Figure 3.

8

Just as the “Human disease network” has enabled the definition of the “diseaseome”, a comprehensive representation of diseases and their interrelationships with genomic factors [26], we introduce the “Envirome- disease network” to aid in the definition of the envirome and its interplay with common diseases (Figure 3, Figure 4). The “Envirome-disease network” consists of weighted links corresponding to the number of publications between specific factors and WHO-prioritized diseases [27], including cardiovascular disease, coronary disease, hypertension, type 2 diabetes (T2D), kidney disease, premature births, lung cancer, prostate cancer, colorectal cancer, Alzheimer’s disease, asthma, and chronic obstructive pulmonary disease (COPD). In other words, the more publications published between a disease and a factor, the stronger the link. As seen in the network (Figure 4), the cardiovascular-related and metabolic diseases cardiovascular disease, hypertension, T2D, kidney disease), appear to share many diet- and therapy- related factors, such as carbohydrates and anti-hypertensive drugs. Further, lung-related diseases such as asthma, COPD, and lung cancer share factors such as smoking, tobacco smoke, and air pollution. Conditions such as premature births, colorectal cancer, and COPD are less studied relative to cardiovascular related diseases or breast and lung cancer.

9 Estrogen Replacement (77) Viral InfluenzaResins,Delivery,Phthalic Resins, TolueneAnimal Dietary Fats (66) AmylasesObstetricBeclomethasoneAnti-Inflammatory 2,4-DiisocyanateVaccines Fungi VaccinesTerpenesPlantAnhydrides Synthetic IsocyanatesFeed Alcohol Drinking (43) OzonePaintPlant DinoflagellidaImmunoglobulinAgentsCereals AllergensSulfurIrritants Tetanus Smoking (30) AerosolsLatexFood Flour ExtractsCromolynEndotoxinsMarine AcetaminophenCesarean E BronchodilatorDioxide Toxoid Environmental Exposure (7) Abortion, Induced (36) Chlorine Infant AlbuterolAntitubercular BeddingFatty SodiumToxinsSection Polyurethanes AgentsPowders Oral Contraceptives (34)Environmental RespiratoryFormaldehyde Cyanates Agents Occupational Exposure (7) andAcids, Food GarlicHumidityAntibodies,CobaltDetergents PollutionSoybeans Therapy SeafoodHeatingUltraviolet Eggs Air Pollutants (5) HazardousMedroxyprogesteroneGenisteinEstrogens,Prolactin ProgesteroneUnsaturated Linens Monoclonal CentralPacemaker,Dichlorodiphenyl Follicle Water Rays Air Pollution, Indoor (5) Smoking (14) LightSubstancesConjugated17-Acetate AsthmaAnti-AsthmaticAdrenergic NervousDichloroethyleneArtificialStimulating Pollutants, Abortion, Induced (12) AntidepressiveDiethylstilbestrolHair(USP) Zinc beta-AgonistsAgents Breast HormonePolychlorinatedProgestinsChemical ProsthesesMammaplastySystemAgents Estrone Splenectomy Dyes Metals Air pollutants (6) DepressantsSunlightImplants Ovulation Biphenyls Pulmonary Nitrogen Smoking (30) SelectiveFattyandIodineEstrogens,PesticidesTetrachloroethylene NickelDust DDT Hydrocarbons,Induction DioxideConstruction Alcohol Drinking (3)EstrogenRadioisotopesImplantsDehydroepiandrosteroneAcids, Disease, Environmental Exposure (7) EndocrineNon-Steroidal Smoke Air Materials Anti-bacterial agentsReceptorEstradiol (3)Omega-3 Electricity Chlorinated Chronic Occupational Exposure (7) DisruptorsSiliconesLignans Vitamin Pollution, ContraceptiveModulatorsCongenersDeodorantsSurgery, Obstructive Breast D VitaminIndoor Air Pollutants (5) DevicesPlastic Air Tobacco Carcinogens,Insecticides ClindamycinPregnancyAbortion, Abortion,Neoplasms Insulin-Like Pollutants A Vehicle Smoke Air AlphaPollution,Radon Indoor (5) Dietary TumorEnvironmentalCarcinogens Plutonium ElectrosurgeryFertilizationReduction,LegalLaser Induced InhalationEmissions Pollution Tobacco MineralsCoal GrowthFats AirMarkers, Air Asbestos,ParticlesQuinazolines Corticotropin-ReleasingMultifetalTherapyin Fatty Exposure AluminumTar EstrogensGlucocorticoidsPropranololDehydroepiandrosteroneAlcoholicEpinephrine Factor PollutionBiological Pollutants, Air Amphibole Vitro Obstetric Cold Acids Nitric 7,8-Dihydro-7,8-dihydroxybenzo(a)pyreneRadonGlassSilicates HormoneFolic BeveragesSulfate Binding Inflammation Occupational Pollutants, Labor,Estrogen Temperature GlycoproteinsOxide DaughtersCoke9,10-oxide Amyloid Acid EnvironmentalProtein Electromagnetic MediatorsBile Particulate Radioactive Asbestos,IndustrialRadioactiveTitanium Replacement Polypropylenes Acrylonitrile beta-Protein PrematureHydroxymethylglutaryl-CoA SerotoninBeverages3 Fields Acids OccupationalMatter CrocidoliteWastePollutants DentalAluminumNootropic Exposure Cholecystectomy Lung Urethane TherapyContraceptives,Reductase EstradiolUptake EthanolColorectal and Exposure IsoniazidHydrocarbons DopamineAmalgamAgents Contraceptives,AndrogensHormoneRaloxifene NeoplasmsNitrosamines Coal NeuroprotectiveIndans Alzheimer GonadalOvariectomyInhibitorsOral,Inhibitors Coffee NeoplasmsSalts GastrinsBeer PolycyclicMustard Agents MeatReplacementOral, Alcohol CarbonTalc CholinesteraseMemantineAgents SteroidHormonal Carotenoids Hydrocarbons,HydrazinesGas Calcium Disease Anti-BacterialCombinedTherapy Caffeine Drinking Inhibitors Hormones Adrenal VitaminSmoking AromaticBismuthPolycyclicCompoundsPolonium Agents Noise, Asbestos OrthomyxoviridaeChromatesbeta Contraceptives,Cortex EnvironmentalHeterocyclic E CompoundsAir Smoking (12) Transportation Arsenic Asbestos,Tars CaroteneSoot Cytokines HormonesOral Pollutants Compounds Pollution,Thorium Aluminum (6) Cholesterol, Diet Serpentine AnticholesteremicPiperidines Analgesics, MineralCeramicsRadioactiveDioxide HDL Vaccination Selenium UraniumBerylliumTin Electromagnetic Fields Agents(6) Testosterone Insulin Non-NarcoticHeart Tobacco,Iron Fibers Lead Marijuana Benzopyrenes Adrenergic RadiotherapyTransplantationLung FibrinogenPlant Cholinesterase Inhibitors (5) NoiseIron, SmokelessSmoking Smoking (22) Anti-Inflammatorybeta-Antagonists TransplantationDiabetes Oils CholesterolSodium (4) Dietary Cholesterol,Aspirin Diet (14) PhenelzineIntubation, Agents, Lipids Mellitus, ChlorideHydralazine DietaryNicotine IntratrachealNephrectomy Non-Steroidal Type Alcohol Drinking (13) NorepinephrineErythropoietinCilazaprilDiet, Cholesterol ThiazolesDietary ElectroconvulsiveNicardipineEndarterectomy, DiureticsHysterectomy 2 GlucoseChromansFatty Meat (11) SpironolactoneAngioplasty,Protein-RestrictedGlycyrrhiza Dietary AntipsychoticMeatSucrose 17-HydroxycorticosteroidsTherapyCarotid Cardiovascular Acids,Carbonated BalloonTranylcypromine PeritonealSurgical TriglyceridesHypoglycemic Antilipemic Lipoproteins,FerritinsCarbohydrates ProductsAgents Dietary Fats (8) EthanolaminesBenzimidazolesEnalapril Agents Thiazolidinediones NonesterifiedBeverages Tourniquets Dialysis,Procedures, MagnesiumAgents Diseases LDLAntineoplastic DesoxycorticosteroneArteriovenousHypertension AntiretroviralHot Calcium Hypoglycemic Agents (23) Electrocoagulation Sodium,ContinuousOperative Cadmium Agents Silicon GuanethidineShunt,Oxygen TemperatureTherapy,Channel Antibodies,TissueSmog AmbulatoryCyclosporineDietary Intercellular DioxideInsulin (15) CatheterizationBiphenylSurgicalSteroidsCoronaryLithium HighlyVibrationBlockersPeptideLosartan Cathartics BacterialPlasminogenVitamin TetrazolesDexamethasoneMethyldopaClonidine AntioxidantsC-ReactivePeritonealAnesthesia, Kidney PravastatinAdhesion Arginine CompoundsMineralocorticoidsBloodTractionArteryAngiotensin AldosteronePlateletCarbon Interleukin-6B Diet, Smoking (11) CyclooxygenaseActiveCholesterol,DialysisFragments General MetforminPhosphodiesteraseMolecule-1Cyclooxygenase ActivatorAscorbic AdrenalectomyVessel II HydrocortisoneProteinAggregationNatriureticDisulfide Sodium Transplantation PolyvinylsReducing Coffee (9) PerindoprilMonoamineBypass Noise, SirolimusInhibitors HIVCardiovascular ComplexMalondialdehyde2Acid ProsthesisPhenoxybenzamineFurosemide InhibitorsPeptide,InhibitorsChloride,LDL Metals, Carbon AmlodipineOxidaseLisinoprilProstaglandins Occupational Diet,von GrowthProteaseInhibitorsSuperoxide Agents Carbohydrates (7) EndarterectomyVasodilator Brain HeavyTacrolimus Monoxide Pyridoxine InhibitorsAldosterone Nifedipine Angiotensin-ConvertingDietary Blood AtenololVegetarianWillebrand HormoneInhibitors Dismutase CatecholaminesPotassiumPositive-PressureAgents AnesthesiaLiver Kidney PyrazolesHumanContraceptives, Vasectomy Endoscopy,Hematinics LipidAntagonists Tyramine Antihypertensive EnzymeTransfusion ProgesteroneAnti-HIVThyroxineFactor AdrenocorticotropicRespirationAngiogenesisSodium ImmunosuppressiveTransplantationRenal Erythropoietin,GrowthEstrogenAnesthesia,Oral,Gastrointestinal Lactones PeroxidesRecombinant Reserpine AngiotensinAgents InhibitorsCalcium DiseasesCongeners Agents Hypoglycemic Agents (23) HormoneInhibitors AgentsDialysis EnterovirusLipoproteins,RecombinantHormoneAntagonistsPiperazines SyntheticEpidural CardiacII HDL Insulin (15) LithotripsyType Extracorporeal CardiopulmonarySurgicalPrednisone Smoking (11) ProceduresBypass1 Circulation Cystatins Receptor Creatinine Coffee (9) Blockers BoneBCG Carbohydrates (7) Vaccine Chlorambucil Antihypertensive Agents (133)Marrow CondimentsMercury PharmaceuticalAnalgesics,SulfonamidesOrganomercuryNephrostomy,Angioplasty,Oxalates Lithotripsy (18) Salt, Dietary (29) TransplantationComplementSmallpoxImmunoglobulinHematopoietic Chlorothiazide Dietary TyrosineWhole-BodyPreparationsAminocaproic OpioidCompoundsTransluminal,Percutaneous VaccineMethoxyfluraneSystem StemG Aminoglycosides ProteinsKidney Transplant (15) Kidney Transplant (24) Radiotherapy,RadioisotopeEmbolization,IrradiationDioxoles Phenacetin Acids Percutaneous ProteinsAnalgesicsUrinaryCell TetracyclineEdetic Smoking (12) Renal Dialysis (23) High-EnergyTeletherapyTherapeutic Coronary TransplantationDiversionAcid Phenacetin (8) Aldosterone (21) Renal Dialysis (6) Figure 3. Envirome-disease network for WHO priority diseases [27]. A link between an environmental factor and disease node denotes their association in MEDLINE. Each link is scaled in size according to the amount of citations for that association; for example, the largest link is observed between “Smoking” and “Lung Neoplasms”. Environmental factor and disease nodes are scaled in proportion to the number of connections they have with other nodes (their “degree”); similarly, labels appear for nodes with degree ≥ 4. Factors studied unique to specific diseases are observed on the outer portion of the figure with single links, while factors linked with many diseases are toward the center. Specific factors colored according to their category (Table 1); disease nodes are not colored. Top 5 cited factors for each disease are annotated offset with the number of citations in parentheses.

10

Lung Neoplasms Smoking Air Smoke Tobacco Pollution 2 Pollutants, Occupational Type Mellitus, Diabetes Selenium Exposure Occupational Chronic Disease, Pulmonary Obstructive Exposure Inhalation Dietary Carbohydrates Air Smoke Carotenoids Pollution Colorectal Neoplasms Pollutants Asthma Kidney Environmental Air Diseases Pollutants Radiotherapy Alcohol Fields Drinking Diseases Diet Lipids Fatty Acids Electromagnetic Oral Cardiovascular Kidney Insulin Coffee Cholesterol Transplantation Contraceptives, Caffeine Fats Dietary HDL Exposure Diuretics Surgical Operative Cholesterol, Environmental Procedures, Agents Agents Agents, Anti-Bacterial Non-Steroidal Antihypertensive Anti-Inflammatory Breast Neoplasms Labor, Testosterone Obstetric Premature Disease Alzheimer Hypertension

Figure 4. “Zoomed-in” Envirome-disease network for WHO priority diseases. For clarity, only nodes of degree ≥ 4 are seen here. See figure 3 for caption and full network.

11

From such a comprehensive representation, we may begin to assemble the envirome from a set of factors investigators have prioritized through study of their relationship with disease. However, for discovery and hypothesis generation, arguably the most important associations are ones that have few or no citations, for example factors found in the lower left quadrant of Figure 2, or corresponding factors with a single or no link in the envirome-disease network (Figure 3). In the following section, we present how we may systematically create these hypotheses to establish links for further study.

Creation of robust hypotheses connecting the environment, genome, and multifactorial disease Genomics and informatics have enabled the creation of novel and validated hypotheses on a multidimensional breadth for multifactorial diseases. This scale is required for complex diseases where many such factors are thought to take part. Through GWAS we have discovered genetic variants that have enabled scientists to postulate about the function of the pathways discovered in diseased individuals and have led to biological and clinical experimentation (for example, [28-31]). While GWAS has come short to explain total heritable risk of complex disease [32], the data-driven methodology has enabled us to collect robust, multidimensional evidence, opening new avenues of investigation [32, 33].

In this dissertation, we have developed and implemented analytic methods to associate environmental factors to multifactorial disease (Figure 5). Specifically, we propose methods that scale across experimental frameworks, from populations (Figure 5 A, C) down to molecules (Figure 5 D, F). Further, they scale in resource utilization through use of publicly accessible data sets.

12

Legend Envirome domain Genome domain

Illustrative examples in italics A. Environment-wide Association Studies D. Envirome Map (Chapter 2) (EWAS) (Chapter 3)

F. Envirome-disease expression Heptachlor CYP1A1 C. G-EWAS (Chapter 4) signature correlation function

Population scale Population Molecular scale Molecular (Chapter 2)

expression) MAPK1

SLC30A8 & Heptachlor cance A ESR1

ype 2 Diabetes ype A bisphenol

T Breast Cancer signifi (ie, (ie, mRNA Genome-wide arsenic PCB170 cance

e.g. bisphenol disease association disease Envirome-wide Corr ype 2 Diabetes ype Genome-wide ( ) T

signifi EW factorsAS AS B. Genome-wide Association Studies E. Disease expression studies

e.g. GW disease association disease loci (GWAS) ,

Diseases SLC30A8 cance ype 2 Diabetes ype T signifi e.g. disease association disease Breast Cancer Type 2 Diabetes Coronary Heart Disease

Figure 5. Overview of population- and molecular-scale methods to create hypotheses across the envirome and genome. Examples are fictional and shown for illustrative purposes. A.) An Environment-wide association study (EWAS) is a population scale method to screen multiple factors in the envirome for association with a disease of interest. Depicted is a “Manhattan Plot”, a way to visualize strength of association for each factor over the envirome. For example, the pesticide Heptachlor is depicted as the highest ranking finding in an EWAS in association to Type 2 Diabetes. B.) Genome-wide Association study or GWAS in which genomic variants are associated to disease on a genome-wide dimension. C.) Genome- Environment-Wide Association Study (G-EWAS), in which marginal findings from Genome- wide Association Studies (GWAS) and EWAS are integrated and screened jointly for interaction in context of a disease of interest. For example, evidence for Heptachlor and SLC30A8 interaction against Type 2 Diabetes is shown in the 2D plot. D.) Representation of an “Envirome Map” whereby gene expression “signatures” induced by physical environmental factors on model systems are summarized in a matrix. For example, Bisphenol A has a signature consisting of CYP1A1, MAPK1, and ESR1 expression. E.) Standard gene expression disease responses collected on multiple diseases, for example Type 2 Diabetes, Breast Cancer, and Coronary Heart Disease. F.) Method to correlate environmental factor expression signatures to disease state, for example Bisphenol A to Breast Cancer. A-C.) considers studies on a population scale, D-F.) on a molecular or toxicological scale. Depiction of the envirome domain is seen in green, the genome domain in red. Examples are shown in italics.

13 Creating hypotheses comprehensively on a population scale In EWAS (Chapter 3 and 4, Figure 5A), as in GWAS (Figure 5B), multiple environmental factors, or the “envirome”, are interrogated against disease state using an epidemiological study design. These studies can be “case-control”, whereby factor variability is compared between incident disease cases (individuals recently diagnosed) and non-diseased individuals. EWAS can be utilized in other observational study designs, most notably a “cohort” or “cross-sectional” one, in which individuals are sampled from a pre-defined population and their disease status is estimated after the sampling process. Both selection and determination of cases and controls is a well-studied in epidemiology, and non-optimal selection can lead to biases in estimates [34]. For example, if a disease cases are misclassified as the opposite, estimates may be attenuated.

Other types of biases abound in observational studies and must be acknowledged and considered. For example, in a cross-sectional study one cannot easily resolve the temporal relationship between exposure to a factor and disease (i.e., did the disease come first, or the exposure?). This bias is known as “reverse causality” [34]. Other biases, such as confounding, also hinder inference in observational studies. A confounding variable is one that both correlates with the disease state and the factor; thus, association of the factor to the disease can be thought of as a stand-in between the confounding variable and disease. Confounding variables need not be measured. While introduce means to estimate confounding bias in this dissertation utilizing measured variables (Chapter 3), confounding bias cannot be avoided altogether.

We have developed and applied EWAS [35, 36] using a cross-sectional dataset known as the National Health and Nutrition Examination Survey (NHANES), a representative survey of the non-institutionalized United States population carried out by the Centers for Disease Control and Prevention and the National

14 Center for Health Statistics [37]. Participants of NHANES are surveyed regarding their health and disease through a battery of questionnaires, physician-led physical exam, and urine- and blood-based laboratory tests. A series of lab tests (N=150-266, depending on survey year) consist of blood or urine markers of environmental factors, such as heavy metals, persistent organic pollutants, nutrients and vitamins, antibodies against allergens, and indicators of pathogenic exposure. Furthermore, several questionnaire items are used as proxies of environmental factors (N=300-1000, depending on survey year), such as years smoked and pharmaceutical and drug use. Furthermore, many clinical biomarkers used for disease diagnosis and risk prediction, such as body mass index, fasting glucose, and serum lipid levels are jointly measured, providing a platform to create hypotheses regarding prevalent diseases without investment in recruitment of subjects.

In its current form, each factor in an EWAS – or the “envirome” -- is comprehensively associated to disease or phenotypic state, often depicted in a “Manhattan plot” (Figure 5A), a transparent representation of all findings. Like GWAS, the framework calls for massive number of comparisons, which can lead to false positives. EWAS utilizes the “false discovery rate” (FDR) to account for multiple comparisons, which provides a quantitative estimate of the number of false discoveries at a given level of statistical significance and is less conservative than frequentist methods for control [38-40]. Along with a more stringent threshold to account for false positives, positive findings are evaluated in independent cohorts and surveys. Multiple comparisons are not considered in most environmental epidemiological studies and this level of stringency allows for robust and quantitative prioritization of findings [10, 41, 42].

We have applied EWAS on multiple disease related phenotypes, notably T2D [35] and serum lipid levels [36], risk factors for coronary heart disease

15 (Chapter 4). Like GWAS, we were able to identify and validate in independent surveys both novel and known factors associated with the phenotypes that should be followed up in additional epidemiological or toxicological studies. For example, we have created hypotheses about the pollutant factors such as polychlorinated biphenyls and organochlorine pesticides, both associated with significant increased T2D prevalence and adverse lipid profiles. Surprisingly, a vitamin marker, γ-tocopherol, was also observed to have an adverse relationship with the diseases, leading us to hypothesize both about reverse possible harmful effects of the vitamin [43].

Common disease is hypothesized to be a combination of both genetic and environmental factors, but GWAS and EWAS examine these domains without consideration of the other, or conduct marginal associations. We propose another population-based study, called a Gene-Environment-Wide-Study (or G-EWAS, Figure 5C, Chapter 5), to examine the joint effect of these factors. Specifically, we interrogate individual findings from GWAS and EWAS, testing whether each pair-wise combination of a genetic and environmental factor “interact”, known as “gene-environment interaction” [44]. When testing for interaction, we examine whether the joint effects are greater or smaller than when considering marginal effects alone.

Interaction effects enable investigators to postulate about biological mechanisms underlying the disease of interest. As an example, investigators have recently confirmed increased risk of bladder cancer for individuals with variants in NAT2 gene and who smoke [45]. NAT2 is a gene that metabolizes chemical compounds; presence of a statistical interaction has prompted a hypothesis between altered NAT2 function, chemical compounds found in cigarette smoke and their metabolites, and pathology of bladder cancer [46].

16 However, there are a few problems with current gene-environment investigations [44, 47] (Chapter 5). First, like most environmental epidemiological investigations, gene-environment interaction studies rely on a priori selection of factors to test. The task of choosing factors is particularly daunting given that the number of common genetic and environmental variants is approximately on the order of thousands to millions. Second, gene- environment studies are resource intensive, requiring exponentially greater sample sizes than studies that study either component alone. Relatedly, these studies can be analytically intensive, incurring a large multiple hypotheses cost due to the many combinations of factors to test. Therefore, in the spirit of EWAS and GWAS, we propose G-EWAS. This approach attempts to solve the problem of choosing which factors to test while alleviating the some of the analytical burden of testing a large hypothesis space.

Like EWAS and GWAS, G-EWAS is a data-driven method to find interactions between robust and replicated variants marginally associated with disease in EWAS and GWAS (Figure 5B). This method avoids the variable selection bias that has plagued candidate gene studies [48] while keeping the hypothesis space constrained. Specifically, each possible pair of factors found in EWAS and GWAS are tested for interaction association to the disease of interest, screening a 2-dimensional hypothesis space on the order of hundreds, not thousands (Figure 5C). Gene-environment studies are power-intensive and keeping the hypothesis space as small as possible is desired [47]. Relatedly, increasing the number of hypotheses increases the burden of false positives and using frequentist methods for multiple hypothesis control (i.e., Bonferroni) becomes too conservative and methods to estimate the FDR are necessary.

We demonstrated the utility of G-EWAS in application to T2D (Chapter 5) and found an interaction between a non-synonymous variant in the SLC30A8 gene and 2 vitamin markers, γ-tocopherol and trans-β-carotene after adjustment for

17 risk factors and consideration of multiple comparisons. Of note, we observed up to 30-40% increased genetic risk when considering specific environmental factors. Of course, proof of statistical interaction does not imply an etiological relationship between the factors. However, investigators have observed that diabetes is only induced for SLC30A8 knockout models in presence of a high- fat diet [49, 50]. Results here strengthen this hypothesis and offer specific factors present or absent within diet to induce a diabetic state. With G-EWAS, we have a platform to produce multiple data-driven hypotheses regarding biological mechanisms through gene-environment interactions. Further, the these interaction findings have implications for personalized genetic risk [23].

Creating hypotheses comprehensively on a molecular or toxicological scale Toxicology is concerned with the physical substances and exposures that lead to adverse changes on the organism and/or molecular level, and, how organisms are exposed to substances. Specifically, toxicologists utilize the physical sciences to measure how physical substances interact with biological systems to induce physiological change. For example, how do biochemical processes alter a substance for digestion, absorption, and excretion, or what are the “toxicokinetic” properties of a system given exposure? Second, how does the substance induce “toxicodynamic” change, for example, how does the substance influence specific targets? Last, how do toxicokinetics and dynamics influence functional change such as in cell viability and metabolism [3]? For example, a cornerstone of toxicology is known as “dose-response” modeling, in which a molecular response is correlated with controlled doses of a substance. Ascertaining a dose-response relationship enables inference regarding the type of relationship between a substance and a biological system connected to the response (e.g., adverse or protective effect).

Another way to ascertain molecular response includes utilizing genome-wide measurements, such as commoditized gene expression microarrays. This

18 subfield of toxicology is known as “toxicogenomics” and aims to “study the response of a whole genome to toxicants” [51]. In contrast to the population- based EWAS approach, toxicogenomics offers how specific environmental factors may perturb a biological system; however, these responses are unconnected to complex diseases.

In Chapter 2, we show how one may use to tools of integrative genomics to connect toxicogenomic responses with disease-associated responses, thus enabling hypothesis generation between specific physical environmental factors to complex diseases, such as cancer (Figure 5D-F). For example, landmark genomic research have linked chemically induced functional changes to disease and related phenotypes in context of therapeutic prediction. Lamb et al., in an effort dubbed the “Connectivity Map”, correlated 164 drug-induced gene expression changes on cell lines to human disease-associated gene expression states, predicting novel molecules for therapeutics [52]. Analogous to this work, we ask what potentially environmentally induced changes are correlated with, and might explain variation in functional disease states.

The proposed method takes full advantage of the publicly available toxicological and disease-related data such as the Gene Expression Omnibus (GEO) [53], the Comparative Toxicogenomics Database (CTD) [54], the Toxin and Toxin-Target Database (T3DB) [55], and the National Toxicology Program’s ToxCast effort [12, 56], thus providing a scalable way to derive hypotheses with minimal effort in upfront experimental design.

To begin, numerous experiments examining disease-associated gene expression are accessible in GEO. Furthermore, it is possible to compare these disparate datasets to make inferences over the aggregate. For example, Dudley et al. have collected 238 disease-associated expression responses from GEO for cross-disease analysis [57]. Further, the authors have shown that signal

19 associated with disease is stronger than experimental or tissue-of-origin artifacts [57]. And, most importantly, the authors have used this representation to predict novel therapeutics connected with disease in an unbiased manner [58]. We claim the same can be achieved with environmental factors.

Toward this goal, we represent a compendium of disease-associated gene expression datasets in matrix form, where columns represent different diseases, and rows , and each entry a measure of differential gene expression (Figure 5F) corresponding to the gene-disease pair. This representation allows one to systematically infer over these disparate datasets. Of course, when aggregating data over multiple experiments from different investigators, care must be taken to ensure the inter-comparability of the data [59, 60]. Nevertheless, the commoditization of measurement platforms has enabled data standardization easing some of the burden of ensuring their comparability [57].

Prototypical gene expression experiments in toxicogenomics can be framed similarly (single columns of Figure 5D). These experiments typically involve characterizing a range of dosages of a handful of environmental chemicals on gene expression of a model organism, such as mouse or rat, or cell line system. These experimental data files are then submitted to GEO or are summarized in databases such as the CTD. Just as the Connectivity Map contains has compiled “signatures”, or patterns of expression for individual chemicals for prediction of therapeutics for disease states, one can do the analogous utilizing numerous toxicogenomic experiments covering multiple environmental chemicals: we call this effort an “Envirome Map” (Figure 5D).

We then query the Envirome Map (Figure 5D) with specific disease-associated expression datasets (Figure 5E), correlating environmental signatures with disease-associated expression states (Figure 5F). In Chapter 3, we develop a method to compute correlations between these datasets [61]. Specifically, each

20 environmental chemical expression signature is queried for enrichment for genes expressed in the disease expression set. If a disease expression dataset has many genes expressed in a chemical signature greater than chance alone, one concludes the chemical signature and disease expression states are highly correlated. This process is repeated for every chemical in the Envirome Map. Again, multiple hypotheses are considered by estimating the FDR and the top ranked correlations are the top hypotheses generated from the procedure.

We utilized the Envirome Map to create hypotheses regarding Breast, Lung, and Prostate Cancers (Chapter 3, Figure 5F). Specifically, gene signatures of established factors such as estrogens were highly correlated with breast cancer. We also observed an endocrine disruptor, bisphenol A, to be associated with expression states of the disease. However, we lack of directionality of such associations and further experimentation is needed to characterize the toxicokinetics and toxicodynamics of BPA in relation to breast cancer. We discuss validation of these hypotheses in the next section.

Discussion We have introduced a representation of the collection of all dynamic and specific environmental factors called the envirome. In our brief survey of how this entity is studied, we observed its breadth and heterogeneity (Figure 1, Figure 2, Figure 3). However, despite its breadth, the envirome is not studied as rigorously as the genome. To this end, we propose population- and molecular-scale methods to enable scalable hypothesis generation between the envirome and disease.

The next question is what happens with these hypotheses? We discuss and introduce methods for validation to infer population risk and second, discuss new study designs to investigate molecular response of predicted factors.

21 On a population scale, validation of associations to affirm risk ideally occurs in diverse populations in study designs that minimize confounding and reverse causal bias. For example, the randomized trial is the “gold standard” for validation of therapeutics. As directly randomized trials are not possible for factors with adverse associations, prospective studies may be executed, utilizing cohorts followed through time or even of multiple familial generations, such as the Framingham Study [62, 63]. However, while we may understand the temporal pattern of the exposure-disease relationship, biases still cannot be excluded.

Taking a step back, standardization of methodology and measurements has enabled validation of results and verification of risk estimates derived from genome-wide studies. For example, it is now typical for large consortia to validate genetic results; for example, recent GWAS results have been strengthened by multiple meta-analyses, collecting individuals on up to 100,000 individuals around the world [64, 65]. We argue that data standards for the envirome would enable validations similarly, aiding design of longitudinal studies that can be combined for meta-analyses. On this point, the PhenX project is centered on building consensus of type and measurement for high priority environmental factors and phenotypes [66, 67]; however, they have yet to be adopted in high-profile epidemiological studies.

Introduction of standards would enable methods such as “Mendelian randomization” to be comprehensively evaluated as a tool for validation [68, 69]. Mendelian randomization provides a way to approximate a randomized trial through use of genetic loci that vary with exposure independent of phenotypic state. Therefore, the association between disease and an exposure is mimicked by the genetic variant and disease. Given that genetic variants assort randomly, we avoid the biases in traditional association analyses by using variants as stand-ins for exposures. Following from this, a set of these

22 variants can possibly be utilized to validate factors found in EWAS. The central challenge would be in the determining what variants vary with the massive numbers of possible factors that could be found in an EWAS. Further, the variation must be described in populations of interest. Nevertheless, GWAS have explored genetic variation in relation to environmental factors, such as smoking dependence and consumption [70, 71], alcohol consumption and dependence [72-74], infection susceptibility [75-77], coffee consumption [78], exercise [79], and vitamin B levels [80], in addition to existing “pharmacogenetics” studies which associate genetic variation to drug response [81]. For example, suppose one hypothesizes a relationship between coffee consumption and diabetes. Suppose also a (hypothetical) genetic variant called COFFEE has been identified as strong marker correlated with the amount of coffee an individual drinks per day. Therefore, evidence that the COFFEE marker is associated with diabetes will support our original hypothesis; if it does not, we might conclude that the association is biased. This is just a simple example; there is need for investigation and novel methods that utilize sets of such proxy genetic variants as means to validate environment-disease associations.

On the other hand, we must test how chronic and low-doses of specific environmental factors modulate molecular responses in model, but clinically relevant, systems in order to learn about disease biology. On a molecular scale, it has long been possible to attain a wealth of both phenotypic and genotypic data from model systems [82] and these methods should complement toxicology to elucidate disease biology of hypothesized factors [83]. Ideally, however, we should attempt to study molecular response in actual populations, eventually investigating both disease biology and population risk simultaneously.

23 In fact, initial investigations have blurred the line between traditional molecular- and population-based approaches in studying external factors in context of complex phenotypes. For example, Idaghdour and colleagues have shown the relationship between genetic variation and leukocyte gene expression in the context of urban and/or rural habitation in Morocco on 194 individuals [84]; however, geography is an abstraction of specific environmental factors. Future studies of similar design should assess how hypothesized factors correlate with changes in genome-wide molecular measures on a population scale among diseased individuals.

In conclusion and in the following chapters, we describe and apply methods to comprehensively associate multiple specific environmental factors, a subset of the “envirome”, to complex disease for hypothesis generation. Specifically, we introduce methods to link molecular responses to disease states through a representation we call the “Envirome Map”. Second, we describe population- level methods to find novel and robust associations between the envirome and disease called EWAS. Third, we apply EWAS in the context of T2D and serum lipid levels. Last, we show how we can integrate EWAS findings with GWAS to predict how environmental factors modify genetic risk. Future work in deciphering environmental contributions to disease will benefit from specific definition of the envirome, standardization of its measurement, and comprehensive integration of molecular-scale measures on at-risk populations.

24 CHAPTER 2. MAPPING MULTIPLE TOXICOLOGICAL RESPONSES TO COMPLEX DISEASE

“All substances are poisons; there is none which is not a poison.” -- Paracelsus (1493–1541)

INTRODUCTION Certainly Paracelsus, a physician from the Enlightenment Period credited for the beginnings of the study of “poisons” and toxicology, would remark similarly if living today. As a result of modernization and commercialization, the breadth of “poisons”, what we will call in the abstract environmentally sourced physical exposures, have become immense in type and property [85]. As introduced in Chapter 1, toxicology is the study that studies the effect of physical agents on biological systems and often the study is in regard to adverse effect [3].

Specifically toxicologists try to understand fundamental mechanisms – be they biochemical, cellular, genetic, or molecular – of these effects. The history of the field goes back to Paracelsus’ time, and there is a rich literature in methodology in elucidating these effects. One area of practice in toxicology is known as “risk assessment”, or prediction of how toxicological effect results in changes in health [3]. However, “risk assessment” is used in terms of potential, immediate hazard, and non-chronic effect, inferred in model systems and organisms, and high doses [11, 13]. Toxicological risk assessment says very little about chronic or complex disease [86], which is the subject of this dissertation.

Nonetheless, our knowledge regarding the ways chemical exposures induce low-level biological response is increasing with the advent of high-throughput

25 measurement and screening modalities [12, 54, 87, 88]. However, while toxicological response remains unconnected to complex disease and public health, it is also currently difficult to ascertain multiple associations of chemicals to health status without significant experimental investment or large- scale epidemiological study. Use of publicly-available environmental chemical factor and genomic response data – such as toxicogenomic gene expression data-- may facilitate the discovery of these associations.

What is “toxicogenomics”? Toxicogenomics ramps up signatures of “biological response” to the dimension of the entire genome. That is, toxicogenomics refers to the patterns of changes in response due to exposure to physical agents measured via modalities in functional genomics, such as proteomic mass spectrometry and gene expression microarrays and analyzed using bioinformatics techniques [51]. These modalities have become entrenched in functional genomics such that there already exists a rich, publicly available data sources and methods to explore toxicogenomic-level response (Chapter 1).

In the following, we propose to use pre-existing datasets and knowledge-bases in order to derive hypotheses regarding chemical toxicological association to disease without upfront experimental design, extending the work of toxicogenomics. Specifically, we have asked what environmental chemicals could be associated with gene expression data of disease states such as cancer, and what analytic methods and data are required to query for such correlations. This study describes a method for answering these questions. We integrated publicly available data from gene expression studies of cancer and toxicology experiments to examine disease/environment associations. Central to our investigation was the Comparative Toxicogenomics Database (CTD) [54], which contains information about chemical/gene/protein responses and chemical/gene/disease relationships, and the Gene Expression Omnibus (GEO)

26 [53], the largest public gene expression data repository. Information in the CTD is curated from the peer-reviewed literature, while gene expression data in GEO is uploaded by submitters of manuscripts. We use the CTD to create an “Envirome Map” which is ultimately used to create hypotheses about the molecular links between environmental factors and disease states (Figure 5D- F).

Most approaches to date to associate environmental chemicals with genome- wide response can be put into 2 categories. These approaches either 1.) have tested a small number of chemicals on cells and measured responses on a genomic scale, or 2.) used existing knowledge bases, such as , to associate annotated pathways to environmental insult.

The first method involves measuring physiological response on a gene expression microarray. This approach allows researchers to test chemical association on a genomic scale, but the breadth of discoveries is constrained by the number of chemicals tested against a cell line or model organism. These experiments are not intended for hypothesis generation across hundreds of potential chemical factors with multiple phenotypic states. Only a few chemicals can be tractably tested for association to gene activity [89, 90], or disease on cell lines [91], or on model organisms, including rat and mouse [92]. In rare cases, this approach has reached the level of a hundred or thousand chemical compounds, such as the Connectivity Map, developed by Lamb, Golub, and colleagues [52], which attempts to associate drugs with gene expression changes. After measuring the genome-wide effect on gene expression after application of hundreds of drugs at various doses, drug signatures are calculated and are then queried with other datasets for which a potential therapeutic is desired. While this has proven to be an excellent system to find chemicals that essentially reverse the genome-wide effects seen in disease, the approach of measuring gene expression and calculating

27 signatures across tens of thousands of environmental chemicals is not always feasible or scalable. Although other data-driven approaches have been described [93], few have given insight into external causes of disease.

A second approach has been to use knowledge bases, such as Gene Ontology [94] to aid in the interpretation of genomic results. For example, Gene Ontology analysis of a cancer experiment might elucidate a molecular mechanism related to an environmental chemical. Unfortunately, there is still a lack of methodology to derive hypotheses for environmental-genetic associations in disease pathogenesis, as Gene Ontology and general gene-set based approaches have limited information on environmental chemicals.

In contrast to the previous approaches, we claim that the integration of pre- existing data and knowledge bases can derive hypotheses regarding the association of chemicals to gene activity and disease from multiple datasets in a scalable manner. Gohlke et al. have proposed an approach to predict environmental chemicals associated with phenotypes also using knowledge from the CTD [95]. Their method utilizes the Genetic Association Database (GAD) [96] to associate phenotypes to genetic pathways and the CTD to link pathways to environmental factors. This method has proved its utility, allowing for production of hypotheses for chemicals associated with diseases categorized as metabolic or neuropsychiatric disorders. However, in its current configuration, their method is dependent on the GAD, which contains statically annotated phenotypes in relation to genes containing variants; such DNA changes are not likely to be reflective of molecular profiles of tissues being suspected for environmental influence. Unlike this method, our proposed approach is tissue- and data-driven in that the phenotype is determined by the individual measurements of gene expression in cells and tissues, allowing for the dynamic capture of phenotypes.

28 The approach we propose here is agnostic to experiment protocol, such as cell line or chemical agent tested, and provides for a less resource-intensive screening of chemicals to biologically validate. Our methodology essentially combines the best features of these current approaches. We start by compiling “chemical signatures” in a scalable way using the CTD. As the CTD is a hand- crated collection of chemical-response data, theoretically a chemical signature can theoretically be constructed from primary data of individual chemical expression experiments in GEO. These chemical signatures capture known changes in gene expression secondary to hundreds of environmental chemicals. The representation of these gene expression states for all of these chemicals we dub the “Envirome Map”, introduced in Chapter 1 (Figure 5D). In the following, we describe how to merge the Envirome Map with disease states (Figure 5 D-F).

In a manner similar to how Gene Ontology categories are tested for over-representation, we then calculate the genes differentially expressed in disease-related experiments and determine which chemical signatures are significantly over-represented. We first verified the accuracy of our methodology by analyzing microarray data of samples with known chemical exposure. After these verification studies yielded positive results, we then applied the method to predict disease-chemical associations in breast, lung, and prostate cancer datasets. We validated some of these predictions with curated disease-chemical relations, warranting further study regarding pathogenesis and biological mechanism in context of environmental exposure. Our method appears to be a promising and scalable way to use existing datasets to connect genome-wide toxicological response to disease [61].

29 METHOD TO PREDICT ENVIRONMENTAL ASSOCIATION TO GENE EXPRESSION RESPONSE The Comparative Toxicogenomics Database (CTD) includes manually-curated, cross-species relations between chemicals and genes, proteins, and mRNA transcripts [97]. We downloaded the knowledge-base spanning 4,078 chemicals and 15,461 genes and 85,937 relationships between them in January 2009. An example of a relationship in the CTD is “Chemical TCDD results in higher expression of CYP1A1 mRNA as cited by Anwar-Mohamed et al. in H. sapiens” (demonstrated in Figure 6A). The median, 70th, and 75th percentile of the number of genes related to a chemical is 2, 5, 7 respectively.

With the single gene, single chemical relationships, we created “chemical signatures”, or gene sets associated with each chemical (Figure 6B). Gene sets were created from gene-expression relations spanning 249 species, but most relations came from H. sapiens, M. musculus, R. norvegicus, and D. rerio. We eliminated chemical-gene sets that had less than 5 genes in the set. This step yielded a total of 1,338 chemical-gene sets.

We assemble the Envirome Map (Figure 7) by aggregating all 1,338 chemical- gene signatures from the CTD (Figure 6B). Specifically, each signature can be depicted as a vector whereby each entry represents a functional link between a chemical and gene. The Envirome Map is a collection of these vectors. While we present and apply the Map as a matrix of binary associations, it can easily be configured to represent a richer set of relationships, such as ordinal values depicting the scale of association.

The CTD also contains curated data regarding the association of diseases to chemicals. These associations are either shown in an experimental model physiological system or through epidemiological studies. We used these curated associations to validate our predicted factors associated to disease.

30 There are 3,997 diseases-chemical associations in the CTD, consisting of 653 diseases (annotated by unique MeSH terms) and 1,515 chemicals (Figure 6C). The median, 70th, and 75th , and 80th percentile of the number of curated chemicals per disease is 2, 3, 4, and 5 respectively.

31 A chemical TCDD 4,078

increased expression 85,937 total unique 1 cited relations n Anwar-Mohamed et al chemical-gene relations CYP1A1 gene 15,461 organism H. sapiens 249 Example: "Chemical TCDD leads to higher expression of CYP1A1 mRNA in H. sapiens (Anwar-Mohamed et al)"

B

1,338 total chemical-gene sets gene 1 CYP1A1 or "signatures" x1 60 chemical x2 gene 2 Dioxins 40 AHR 1,338 x n 9

gene n AHR2

xi denote number of references Example: Gene set for the Dioxins chemical, for a chemical-gene relation with 60, 40, and 9 references of CYP1A1, AHR and AHR2 interactions

C

sodium chemical 1 arsenite

x1 disease Prostatic x chemical 2 cadmium 653 2 Neoplasms xn

chemical n bisphenol A

xi denote number of references Example: Chemical set for a disease-chemical relation for the "Prostatic Neoplasms" disease (MeSH: D011471), with references of associating the disease to sodium arsenite, cadmium, and bisphenol A Figure 6. Creation of the chemical-gene signatures based on the Comparative Toxicogenomics Database (CTD). A.) The CTD contained 85,937 total unique chemical-gene relations over 4,078 chemicals and 15,461 genes. Each relation had one or more citations of support. An example hypothetical relation, “TCDD lead to higher expression of CYP1A1 mRNA in H. sapiens as shown in Anwar-Mohamed et al.” is seen on the right panel. B.) Creation of chemical-gene set relations. Each chemical-gene relation had a number of citations of support, xi. For each chemical, we constructed a gene set, or “signature” from the individual chemical- gene relations. We filtered out signatures that had at least 5 genes in the set, leaving a total of 1,338 chemical-gene sets. An example of one chemical-gene set (a column of Figure 5D, Figure 7) is seen on the right panel of B: the genes CYP1A1, AHR, AHR2 are shown to have multiple citations for the relation, 60, 40, and 9 respectively. Each of these signatures in the aggregate forms the “Envirome Map” (Figure 7). C.) Representation of disease-chemical associations in CTD is used for validation.

32 Legend Envirome domain Genome domain

CTD chemical-gene signature Envirome Map

A

1338 gene 1 gene 1 gene 2 function

chemical gene 2 expression) 1,338 gene n (ie, (ie, mRNA Genome-wide chemical gene n

B

CYP1A1 CYP1A1 60 function AHR

Dioxins 40 AHR expression) AHR2 9 (ie, (ie, mRNA Genome-wide Dioxins AHR2

Figure 7. Creation of the ‘Envirome Map’ using CTD chemical-gene signatures. A.) Each chemical in the CTD is functionally associated with a set of genes, described earlier (Left panel, see also Figure 6B). This signature can be represented in the ‘Envirome Map’ (Right panel), whereby each column represents the genes (rows) associated for each environmental chemical in binary form. B.) An example representation for the “Dioxins” signature. The entire Envirome Map is populated with 1338 signatures (columns).

We built a system to test whether genes significantly differentially expressed within a gene expression dataset could be associated with any of the calculated chemical signatures in the Envirome Map (Figure 8A). We conducted two phases of analysis in this study. The first phase was a verification one, testing whether the method could accurately predict known chemical exposures applied to samples Figure 8B). Our input for this first phase were gene

33 expression datasets of chemically-exposed samples and unexposed control samples, and our output were lists of chemicals predicted to be associated with each dataset. The second investigation phase involved predicting chemicals associated with cancer gene expression datasets (Figure 8C). Our input for this second phase were gene expression datasets of cancer samples and control samples, and our output were lists of chemicals predicted to be associated with the dataset. We attempted to validate these findings further by using curated disease-chemical relations (Figure 8D). Finally, we attempted to group our chemical predictions associated with cancer dataset by PubChem-derived BioActivity similarity measures, seeking further evidence of potential underlying mechanism or similar modes of action between chemicals.

34 A B C D Chemically Perturbed Microarray Disease Microarray Data: Data: Disease vs. Non-disease Exposed vs. Non-exposed Chemical annotated dataset Prostate, Lung, Breast Cancers

Estradiol (2), TCDD, Zinc, Bisphenol A, Vitamin D

Significance Analysis of Microarrays 1338 gene 1 Homologene gene 2 mapping function Hypergeometric test

gene n derived from CTD derived

Genome-wide chemical Predictions: Predictions: ranked by p-value, q-value ranked by p-value, q-value

factor p q factor p q 1.) 1.) 2.) 2.) CTD derived disease-chemical relations

chemical disease 1 n chemical Accuracy: Literature Validation: rank of correctly identified proof of disease association chemical among highly ranked chemicals

Figure 8. Predicting environmental chemical association to gene expression datasets. A.) A representation of the 1338 chemical-gene sets in the Envirome Map. B.) For the validation step, we conducted SAM to find genes whose expression was altered in each of our datasets. We then mapped the differentially expressed genes to corresponding extra-species genes in our database by using Homologene. For each chemical-gene set signature in the Map, we conduct a hypergeometric test for enrichment and ranked each result by p-value. C.) We applied the approach used in B to predict chemical association to prostate, breast, and lung cancer data and validated these results with curated disease-chemical annotations from the CTD represented in D.). D.) Representation of the curated disease-chemical associations in the CTD.

We used Significance Analysis of Microarrays (SAM) software to select differentially expressed genes from a microarray experiment [98]. The FDR for SAM for all of our predictions were controlled up to a maximum of 5 to 7% in order to reduce false associations.

We mapped microarray annotations to other corresponding representative species, H. sapiens, M. musculus, and R. norvegicus using Homologene [99]. In the CTD, gene identifiers were commonly associated with H. sapiens; however, some are mapped to specific organisms, such as M. musculus and R norvegicus. Most mappings in the CTD are among these 3 organisms. By

35 mapping our expression annotation to these organisms, we ensured gene compatibility with a large portion of the CTD.

We checked for enrichment of differentially expressed genes among each of the 1,338 chemical-gene sets in the Envirome Map with the hypergeometric test. To account for multiple hypothesis testing, we computed the q-value, or false discovery rate for a given p-value, by using 100 random resamplings of genes from the microarray experiment and testing each of these random resamplings for enrichment against each of the 1,338 chemical-gene sets. This methodology is similar to the q-value estimation method described in “GoMiner”, a gene ontology enrichment assessment tool [100]. We assessed a positive prediction for those that had exceeded a certain p-value and q-value threshold in our list of 1,338 tested associations. All analyses were conducted using the R statistical environment [101].

Method Verification Phase For our verification phase, we surveyed publicly available data from the Gene Expression Omnibus (GEO) for experiments in which sets of samples exposed to chemicals were compared with controls. We found and used six datasets in the validation phase. Set 1 included GSE5145 (3 study samples and 3 controls) in which H. sapiens muscle cell samples were exposed to Vitamin D [102]. Set 2 was GSE10082 (6 study samples and 5 controls) in which wild-type M. musculus were exposed to tetradibenzodioxin (TCDD) [103]. Set 3 was GSE17624 in which H. sapiens Ishikawa cells (4 study samples and 4 controls) were exposed to high doses of bisphenol A (no reference). Set 4 was GSE2111 in which H. sapiens bronchial tissue (4 study samples and 4 controls) were exposed to zinc sulfate [104]. The CTD had some chemical-gene relations based on this dataset; we removed these relations prior to computing the predictions for this dataset. Set 5 was GSE2889 in which M. musculus thymus tissues (2 study samples and 2 controls) were exposed to estradiol [105].

36 Finally, set 6 was GSE11352 in which H. sapiens MCF-7 cell line was exposed to estradiol at 3 different time points [106]. In all cases except for set 6, we treated SAM analysis as unpaired t-tests; for set 6, we used the time- course option in SAM. See Table 2 for the number of differentially expressed genes found for each dataset along with their median false discovery rate.

Dataset Chemical Number of SAM: Number of Tested Samples/Controls median FDR Differentially (tissue type) Expressed Genes / Total GSE5145 [102] Vitamin D3 3/3 (H.sapiens 0.04 805/20555 muscle) GSE10082 TCDD 6/5 (M. musculus 0.05 2066/21863 [103] injection) GSE17624 Bisphenol A 4/4 (H. sapiens 0.04 8406/20828 Ishikawa cells)* GSE2111 [104] Zinc sulfate 4/4 (H. sapiens 0.05 31/13306 bronchial tissue) GSE2889 [105] Estradiol (M. musculus 0.07 112/13383 thymus) GSE11352 Estradiol (H. sapiens MCF7) 0.05 114/20555 [106] Table 2. Gene expression dataset summary for verification stage. 1st column denotes GEO accession, 2nd column is the chemical exposed to the samples. 4th column is the median FDR for SAM. * denotes “high” dosage of Bisphenol A used for the exposed sample group.

37

Predicting Environmental Factors Associated with Disease-related Gene Expression Data Sets: Prostate, Lung, and Breast Cancer We found previously measured cancer gene expression datasets to identify potential environmental associations with cancer. We used measurements from human prostate cancer from GSE6919 [107, 108], lung cancer from GSE10072 [109], and breast cancer from GSE6883 [110]. We conducted all SAM analyses using an unpaired t-test between disease and control samples. See Table 2 for the number of differentially expressed genes measured for each dataset along with the level of FDR control.

We deliberately chose cancer datasets that used a different population of controls rather than normal tissues from the same patients. The prostate cancer dataset (GSE6919) consisted of 65 prostate tissue cancer samples and 17 normal prostate tissue samples as controls.

The lung cancer dataset (GSE10072) consisted of two patient groups: non- smokers with cancer (historically and currently), and current smokers with cancer. We conducted the predictions on these groups separately. The cancer- non smoker group consisted of 16 samples and the cancer-smoker group had 24 samples. The control group consisted of 15 samples.

The breast cancer dataset (GSE6883) consisted of two distinct cancer sub- groups: non-tumorigenic and tumorigenic. As with the lung cancer data, we conducted our predictions on these groups separately. The non-tumorigenic group consisted of three samples and the tumorigenic group had six samples. The control group contained three samples.

We then validated our highly ranked factor predictions with disease-chemical knowledge from the CTD. In particular, we determined if the highly

38 significant chemicals in our prediction list included those that had curated relationship with cancer in the CTD (disease-chemicak relation). This step was similar to measuring association to chemicals via enriched gene sets using the hypergeometric test as described above. We used curated factors associated with Prostatic Neoplasms (MeSH ID: D011471), Lung Neoplasms (D008175), and Breast Neoplasms (D001943), to validate our predictions generated with the prostate cancer, lung cancer, and breast cancer datasets respectively. Further, we assessed the validation by computing the actual number of false positives and true negatives. To compute this number, we assessed whether the prediction list was enriched for chemicals associated with any of the other diseases in the CTD at a higher significance level than the true disease; for this test, we chose diseases that had at least 5 chemical associations, a total of 141 diseases. As an example, to assess the false positive rate for the prostate cancer (MeSH ID: D011471) predictions, we determined the curated enrichment of our predictions for all 140 other disease-chemical sets and counted the number of diseases that had a lower p-value than that computed for D011471.

Clustering Significant Predictions By PubChem-derived Biological Activity Chemical-gene sets derived from the CTD are but one representation of how a chemical might affect biological activity. Biological activity of chemicals may also be derived from high-throughput, in-vitro chemical screens such as those archived in PubChem [87, 111]. Specifically, the PubChem database provides a large number of phenotypic measurements (or “BioAssays”) for many of the chemicals we predicted for cancer. In addition, PubChem provides tools to compare BioAssay measurements for different chemicals. Quantitative and standardized BioAssay measurements (normalized “scores”) allow comparison of biological activities of chemicals and derivation of biological activity similarity between chemicals. For example, PubChem represents the biological

39 activity of a compound through a vector of BioAssay scores and assembles a bioactivity similarity matrix between each pair of chemicals with this data.

We sought further external evidence of the relevance of the predicted chemicals though comparison of their patterns of PubChem-sourced biological activity (Figure 9). First, we produced a list of chemical predictions for each cancer dataset as described above (Figure 8, Figure 9A, Figure 9B) and submitted our list of chemicals to PubChem for activity comparison (Figure 9). Finally, we observed patterns of correlation between PubChem-derived biological activities of the compounds to their chemical-gene set association significance by clustering the chemicals in the prediction list by their biological activity.

40 A B C

Disease Microarray Data: Disease vs. Non-disease

Prostate, Lung, Breast Cancers

1338

Predictions: ranked by p-value, q-value

factor p q 1.) PubChem BioActivity Score Data 2.)

bioassaybioassay 1 bioassay 2 3 bioassay 790 significant chemicals chemical p < α1 66 chemical

PubChem BioActivity Similarity Matrix Significant Predictions: Correlation of bioassay scores clustered by BioActivity Similarity 66 factor p 66

Figure 9. Clustering chemical prediction lists by biological activity archived in PubChem. A.) A representation of the CTD-based Envirome Map as shown in detail in Figure 6. B.) Prediction of the chemicals associated to each cancer dataset using chemical-gene sets from the CTD. We selected highly significant chemical predictions for each cancer and clustered these chemicals by their “Bioactivity” similarity as defined and computed in PubChem. C.) Within PubChem, each of these chemicals has a vector of standardized BioAssay scores. PubChem had 790 BioAssay scores for 66 of our significant predictions. The PubChem BioActivity similarity tool uses these vectors of scores to computes the biological activity similarity for each pair of chemicals and similarity is represented as a matrix.

RESULTS We implemented a method to predict a list of environmental factors associated with differentially expressed genes (Figure 8). The method is centered on creation of the Envirome Map (Figure 5D-F, Figure 7), an aggregation of chemical-gene sets that are derived from single curated chemical-gene response relationships in the CTD (Figure 6). We determine whether the differentially expressed genes are associated to a chemical by assessing if the

41 expressed genes are enriched for a chemical-gene set, or contain more genes from the chemical-gene set than expected at random using the hypergeometric test. We applied this method in two phases, the first a verification phase in which we sought to rediscover known exposures applied to samples, and a query phase, in which we sought to find factors associated with cancer gene expression datasets. We refer to significant chemical-gene set associations to gene expression data as “associations” or “predictions” in the following.

Verification Phase We first applied our method to gene expression data from experiments in which samples were exposed to specific chemicals, reasoning that if our method could identify these known chemical exposures, we could use the method to predict chemicals that may have perturbed gene expression in unknown experimental or disease conditions. Our goal was to determine where a gene expression-altering chemical might lie in the range of significance rankings applied by the prediction method.

We applied our method on datasets that measured gene expression after exposure to vitamin D, tetrachlorodibenzodioxin (TCDD), bisphenol A, zinc, and estradiol (2 datasets) on different tissue types. Table 3 shows the results of our predictions along with a subset of genes in the chemical-gene set that were differentially expressed.

42

Actual Chemicals Hypergeo- Rank q-value Relevant Genes Chemical Predicted metric P- (Percentile) Expressed Exposure value (GEO accession) Vitamin D3 on Calcitriol 1x10-23 1 (100) ~0 VDR (25), H. sapiens CYP24A1 (14) muscle cells (GSE5145) TCDD on M. TCDD 2x10-15 3 (99) ~0 CYP1A1 (59), musculus CYP1B1 (15), (GSE10082) AHRR(6), CYP1A2 (14) Bisphenol A Bisphenol 1x10-6 15 (99) ~0 ESR1(31), on H. sapiens A ESR2(7), S100G Ishikawa cells (6) (GSE17624) Zinc sulfate on Zinc 3x10-3 15 (99) 0.04 SLC30A1 (3), H. sapiens sulfate MT1F(2), bronchial MT1G(2) tissue (GSE2111) Estradiol on M. Estradiol 5x10-3 17 (99) 0.08 C3(6), LPL (4), musculus CTSB (2) thymus (GSE2889) Estradiol on H. Estradiol 5x10-3 19 (99) 0.08 ISG20 (2), MGP sapiens MCF7 (2), SERPINA1 cell line (2) (GSE11352) Table 3. Chemical Prediction Results from the Verification Phase. Each row represents a gene expression dataset and relevant prediction and ranking. The first column specifies the gene expression dataset, the 2nd column the actual exposure applied to the samples for the gene expression set. The 3rd and 4th columns represent the hypergeometric p-value for chemical- gene set enrichment along with the rank of the chemical in the prediction list. The 5th column shows the 5th percentile of the ranking derived from 100 random samplings of genes from the gene expression dataset. The 6th column show notable genes expressed in the chemical-gene set along with the number of references the chemical-gene relation in the CTD.

We were able to satisfactorily predict the exposures applied to the gene expression datasets. We ascertained a positive prediction if the exposure had a relatively high ranking (low p-value for enrichment) and if the q-value was lower than 0.1. For the datasets measuring expression after exposure to Vitamin D, calcitriol, a type of vitamin D, was ranked first in the list (p=10-23,

43 q~=0). Similarly, TCDD was predicted third in its respective list (p=10-15, q~=0). The other exposures ranked within the top percentile, ranging from 15 to 19; the lower bound of p-values were between 10-6 and 0.01 and q-values less than 0.1. We reasoned that we could detect true associations between environmental chemicals and gene expression phenotypes provided they met these significance thresholds.

Predicting Environmental Chemicals Associated with Cancer Data Sets We applied our prediction methods to predict association to cancer disease states, specifically merging the Envirome Map with prostate, breast, and lung cancer datasets. In particular, we computed predictions for prostate cancer from primary prostate tumor tissue, lung adenocarcinomas from lung tissue from non-smoking individuals, and non-tumorigenic breast cancer cells grown in mouse xenografts. To validate and select specific predictions from our ranked list of 1,338 environmental chemicals from the Envirome Map, we measured how enriched top-ranking chemicals were for annotated disease- chemical citations in for diseases of interest (“Prostate Neoplasms”, “Breast Neoplasms”, and “Lung Neoplasms”). To call a positive chemical association or prediction to disease phenotype, we used p-value thresholds similar to what we observed during the verification phase (α ≤ 10-4, 0.001, 0.01) along with q- values as low as possible, specifically less than 0.1. For comparison, we also used the typical p-value threshold of 0.05.

Figure 10, Figure 11, and Figure 12 shows the result of the disease validation phase. In all cases, the significant chemicals contained many of the specific curated disease-chemical relations. For example, if we call chemicals with p- values less than 0.01 as positive predictions, then we were able to capture 18%, 16%, and 7% of all of the curated relationships for prostate, lung, and breast cancers respectively (p=10-7, 10-4, and 4x10-5). We assessed specificity of our list by computing how many curated chemicals we found for all other diseases

44 in the CTD (Figure 10, Figure 11, and Figure 12, offset points in orange and black). We achieved false positive rates between 1 to 4% for prostate cancer, 8 to 20% for lung cancer, and 2 to 10% for breast cancer. However, most all of the “false positives” were other types of neoplasms or cancers (Figure 10, Figure 11, and Figure 12, examples annotated in italics/arrows). For example, for the lung and prostate cancer predictions at α=0.001 only 1 disease other than neoplasm or carcinoma was detected: Liver Cirrhosis, Experimental (MeSH ID: MESH:D008325).

45

! True Negatives ! False Positives

10 Carcinoma, Hepatocellular 10(30%) ! ! Prostatic Neoplasms ! !

Liver Neoplasms 10(29%)

! 1% ! ! }

8 Liver Cirrhosis 10(16%) } 2% ! ! ! ! ! ! } 3% 12(18%) ! ! ! ! ! ! 4% ! } ! 13(19%)! 10(15%) ! ! ! ! 6 ! ! ! ! ! disease enrichment ! ! ! !

− ! ! ! ! ! ! ! ! ! ! ! 7(10%) ! ! !!! ! !! ! ! ! !! ! ! ! ! ! !! !! ! !! ! ! ! ! 4 ! ! !! !! !! ! !! ! ! !!! !! ! ! ! ! ! ! !! ! !! ! ! !!! ! ! !! !! !! !! ! !!! ! ! ! ! ! !!!! !! !! !! ! ! ! !!!! ! ! !! !! !!! log10(pvalue) factor log10(pvalue) !!! ! ! !!!!! !! !!! !! 2

− !!! !!!! !!! !!! ! !! ! ! ! !!! ! ! !! ! !! !!! !!!! !! !! ! !!!!!!! !! ! !!!! ! !!!! ! ! ! !!!!!! ! !!!!! !! !! !!!!! !!! ! !!!! !!!!! !!!! !!! !!! !!! ! !!! !!!! !! !!! !! !! !! !! ! ! ! ! ! ! ! ! ! ! !!!!! !!!!!!! !!!!!!!!! !!!!!!!!!!!!! 0

1.3 (89) 2 (68) 3 (50) 4 (27)

−log10(pvalue) threshold for factor ranking (number of chemicals found)

Figure 10. Curated disease-chemical enrichment versus prediction lists for prostate cancer datasets. For a prediction list, we selected chemicals that ranked within α=10-4, 10-3, 10-2, and 0.05. This –log10(threshold) along with number of total chemicals found (in parentheses) for each threshold is seen on the x-axis of each figure. We tested if these highly ranked chemicals found under each threshold were enriched for chemicals that had known curated association with the cancer in question. The –log10(p-value) for this enrichment is seen on the y-axis. The solid round red marker represents the enrichment test for the actual disease for which the predictions were based; the number underneath represents the total number of chemicals found in the prediction list that had a curated association with the disease and the percent found among all curated relations for that disease. We estimated accuracy and precision by computing disease-chemical enrichment for all other diseases; false positives are offset in black and true negatives are in yellow. The false positive rate is bracketed and in italics. Examples of false positives are annotated in blue italics along with the number of chemicals found in the prediction list corresponding to that disease and the percent found among all curated relations for that disease.

46 14 ! True Negatives ! False Positives ! ! Lung Neoplasms 12 !

! 10 Carcinoma, Hepatocellular ! ! 9(27%) Prostatic

! Neoplasms (15%) 8 ! ! Mammary ! ! ! ! Neoplasms,

disease enrichment ! ! ! ! − ! ! 9% Experimental 7(28%) ! ! 10% } ! } ! ! ! ! ! ! !

6 8% ! } ! !

!! ! !! ! ! ! ! ! ! ! ! ! !! ! ! ! Liver Cirrhosis !! ! !! ! 20% 7(14%) ! } ! ! !! 9(15%) ! ! 9(18%) !! ! 4 ! ! !! ! ! ! 8(16%) !!! !! !! ! ! ! ! ! !! ! ! ! ! !! !!! !! ! !!!! ! ! ! ! !!!! ! !! !! !!! log10(pvalue) factor log10(pvalue) ! ! ! ! !!! − ! !!!! ! ! ! ! !! !! !! ! !! !!! ! 4(8%) !!! !!! !!! !! ! 2 !! !! !!!! ! !! !! ! ! ! !!! ! ! ! ! ! ! !! ! ! !! ! ! !!!!!! ! ! ! !!!!!! !!!!!! !!!!! ! ! !!!!!!!! ! !!! ! !!!!!! !!!!!! !!!!!!! !!! !!! !!!!! !!!! !!!!! ! !! ! !!!! !!!! ! !! !! !! !! !! ! ! ! !! !! ! !!!!! !!!!!! !!!!!!!!! !!!!!!!!!!!! 0

1.3 (84) 2 (73) 3 (42) 4 (29)

−log10(pvalue) threshold for factor ranking (number of chemicals found)

Figure 11. Curated disease-chemical enrichment versus prediction lists for lung cancer datasets. See Figure 10 for complete legend.

47 ! True Negatives ! False Positives 6 ! Breast Neoplasms Prostatic Neoplasms 7(11%)

! Carcinoma, Hepatocellular ! 5 5(33%) ! 10% 2% } ! ! }

! 7(7%)

4 Skin Neoplasms ! ! ! ! 4 (19%)

!! !

! disease enrichment − !! } 4% !!

3 !!

! ! ! !! ! ! ! ! !!! !! ! !! ! 9(10%) ! ! !! ! ! 2 3(3%) ! ! !!! ! ! ! ! !!! ! ! ! ! ! !! !!! ! ! ! ! !! !!! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! !

log10(pvalue) factor log10(pvalue) ! !!!!! ! ! !! !! ! ! !!!!! !! − ! !!!! ! !!!!! ! ! ! !! ! ! ! ! 1 ! ! ! ! !! ! ! !!!! ! !!! ! ! ! !!!!! ! !!!! !!! !!! ! ! ! !!! ! !

!!! !

!!!!! !!!!!!!! !!!!!!!!!!! !!!!!!!!!!!!!!

0 ! 0(0%)

1.3 (86) 2 (28) 3 (11) 4 (5)

−log10(pvalue) threshold for factor ranking (number of chemicals found)

Figure 12. Curated disease-chemical enrichment versus prediction lists for breast cancer datasets. See Figure 10 for complete legend.

For the prostate cancer dataset, we chose a chemical signature association threshold of 0.001 (q ≤ 0.01). Of 1,338 chemicals tested, 50 total were found under this threshold. Of these 50 chemicals predicted, 10 had a curated relation with the MeSH term “Prostate neoplasms”. This amounted to prediction of 15% of all CTD curated disease-chemical relations for the Prostatic Neoplasms term (p = 3x10-7). These chemicals are seen in Table 4 and include estradiol, sodium arsenite, cadmium, and bisphenol A. Also predicted were known therapeutics, including raloxifene, doxorubicin, genistein, diethylstilbestrol, fenretinide, and zinc. We observed that many of

48 the genes detected were well-studied, additional support to our predictions. For example, ESR2, PGR, and MAPK1 had 37, 34, and 14 references respectively citing their activity in the context of estradiol exposure (Table 3, second-to-right column). Second, we observed common occurrence of genes such as ESR2, BCL2, and MAPK1, among some of the gene sets associated with chemicals such as estradiol, raloxifene, sodium arsenite, doxorubicin, diethylstilbestrol, and genistein.

49

Chemical Hypergeo- Rank q-value Relevant genes Citations Predicted metric (percentile) in set (number P-value of references) Estradiol 4x10-10 5 (99) ~0 ESR2(37), [112] PGR(34), MAPK1(14) Raloxifene 1x10-9 6 (99) ~0 ESR2(6), [113] IGF1(5), BCL2(4) Sodium arsenite 1x10-8 8 (99) ~0 JUN(13), [114] MAPK1(9), CCND1(8), FOS(6) Doxorubicin 7x10-7 11 (99) ~0 BCL2(23), [115-118] MAPK1(14), TNF(10) Cadmium 6x10-6 13 (99) ~0 MT2A(14), [119] MT1A(12), MT3(11), MT1(6) Genistein 3x10-5 19 (99) 6x10-4 ESR2(22), [120-122] PGR (10), MAPK1 (5) Diethylstilbestrol 3x10-5 22 (98) 0.001 ESR2(8), [123, 124] FOS(8), HOXA10(4) Fenretinide 3x10-4 40 (97) 0.004 BCL2(3), [125] ELF3(2), LDHA(2) Bisphenol A 6x10-4 47 (96) 0.01 PGR(8), [112] ESR2(7), IL4RA(2) Zinc 9x10-4 53 (96) 0.01 MT3(18), [126-129] MT2A(13), MT1A(11) Table 4. Prediction of environmental chemicals associated with prostate cancer samples (GSE6919). Shown in the table are a subset of the highly ranked chemicals (p < 0.001) that were predicted to have association with prostate cancer gene expression and had evidence of association with the MeSH term “Prostatic Neoplasms” as in the CTD. The 1st column represents the chemical predicted and the 2nd and 3rd columns show the hypergeometric p-value and ranking. The 4th column shows q-value derived from random samples of genes. The 5th column shows the notable genes in the chemical-gene set that were differentially expressed. The 6th column contains references for the prostate cancer and chemical association found from the CTD.

For the lung cancer dataset, we also chose a threshold of 0.001 (q ≤ 0.004). Of 1,338 chemicals tested, 42 were found under this threshold. Of these 42 chemicals, 7 had a cited relation with “Lung neoplasms”, 14% of all curated disease-chemical relations for the term (p = 1x10-5). These chemicals are seen

50 in Table 5. For lung cancer, we observed cited chemicals such as sodium arsenite, vanadium pentoxide, dimethylnitroamine, 2-acetylaminoflourene, and asbestos. Therapeutics observed included doxorubicin and indomethacin. We did not observe common genes represented for different chemical-gene sets, unlike the prostate cancer predictions. Predictions for the smoker-lung cancer samples were similar, resulting in sodium arsenite, dimethylnitrosamine, and vanadium pentoxide, albeit through different differentially expressed genes.

51

Chemical Predicted Hypergeo- Rank q-value Relevant Citations metric (percentile) genes in set P-value (number of references) Doxorubicin 1x10-6 16 (99) 4x10-4 CASP3(60), [130] ABCB1(28), BAX(26), BCL2 (23) Sodium arsenite 8x10-6 20 (98) 4x10-4 JUN(13), [131-133] NQ01(6), EGR1(6) Vanadium pentoxide 1x10-5 24 (98) 6x10-4 HBEGF(3), [134] CDK7(1), CDKN1B (1), CDKN1C(1) Dimethylnitrosamine 6x10-5 27 (98) 7x10-4 TGFB1(23), [135] TIMP1(15), PCNA(6) Indomethacin 2x10-4 34 (97) 0.002 BIRC5(3), [136-138] CDKN1B(2), MMP9(2) 2- 3x10-4 36 (97) 0.003 ABCB1(4), [139] Acetylaminofluorene ABCG2(4), KRT19(2) Asbestos, Serpentine 4x10-4 39 (97) 0.004 IL6(2), [140] MMP9(2), MMP12(2), PDGFB(2) Table 5. Prediction of environmental chemicals associated with lung cancer samples (GSE10072). Shown in the table are subsets of the highly ranked chemicals (p < 0.001) that were predicted to have association with lung cancer gene expression (non-smokers) and had evidence of association with the MeSH term “Lung Neoplasms”.

For the breast cancer dataset, we chose a threshold of 0.01 (q ≤ 0.08). Of 1,338 chemicals tested, 28 were found under this threshold. Of these 28 chemicals, 7 had a cited relation with “Breast neoplasms”, 7% of all curated disease-chemical relations for the disease. These chemicals are seen in Table 4 (p = 4x10-5). The chemicals predicted included progesterone and bisphenol A. Therapeutics found included indomethacin and cyclophosphamide. There was evidence for both a harmful chemical and a therapeutic for chemicals such as estradiol, genistein, and diethylstilbestrol for breast cancer. Unlike the predictions shown for prostate and lung cancer, the genes utilized in the predictions for breast cancer were not as well studied, with 1 to 3 references

52 for the gene and environment association. We observed some commonality in chemical-gene sets, such as the presence of IL6 and CEBPD in several of the top chemicals predicted in association to the disease. Similar chemicals were predicted for the tumorigenic breast cancer dataset, such as estradiol and progesterone. However, chemicals not highly ranked in the non-tumorigenic predictions included benzene and the therapies tamoxifen and resveratrol.

Chemical Predicted Hypergeo- Rank q-value Relevant Citations metric (percentile) genes in set P-value (number of references) Progesterone 2x10-4 6 (99) 0.01 IL6(3), [141, 142] STC1(3), CEBPD(2) Genistein 6x10-4 10 (99) 0.03 CEBPD(1), [143-145] APLP2(1), MLF1(1) Estradiol 7x10-4 11 (99) 0.03 LPL(4), [146-150] IL6(3), CEBPD(2) Indomethacin 3x10-3 17 (99) 0.05 CCDC50(1), [151] BIRC3(1), DNAJB(1) Diethylstilbestrol 3x10-3 18 (99) 0.05 IL6(1), [152, 153] MARCKS(1), MXD1(1), MMP7(1) Cyclophosphamide 4x10-4 19 (99) 0.06 IL6(3), [154-156] MARCKS(1), PSMA5(1) Bisphenol A 6x10-3 21 (99) 0.08 CEBPD(1), [157] MLF1(1), DTL(1) Table 6. Prediction of environmental chemicals associated with breast cancer samples (GSE6883). Shown in the table are subsets of the highly ranked chemicals (p < 0.01) that were predicted to have association with breast cancer gene expression (non-tumorigenic) and had evidence of association with the MeSH term “Breast Neoplasms”.

53

Some of the chemicals found were common to more than one type of cancer. For example, we predicted chemicals such as sodium arsenite for both prostate cancer and lung cancers, and bisphenol A for both prostate and breast cancers. In some of the cases, the predicted chemical overlap across different cancers are due to the expression of distinct genes for each dataset, highlighting the potential of many possibilities for interaction between environmental chemicals and genes.

Clustering Significant Predictions by PubChem-derived Biological Activity We have described a method of generating a list of chemical predictions associated with disease-annotated gene expression datasets and applied the method on gene expression data for several cancers, in effect merging a comprehensive representation known as the Envirome Map with disease datasets. We have validated a subset of our predictions with evidence from the literature as described above. We sought further evidence of the biological relevance of our predictions through internal comparison of their potential activity archived in PubChem. Specifically, we expected some degree of correlation between “similar” chemicals and their gene set significance to the cancer datasets. We opted to use PubChem BioActivity to assess chemical similarity, assuming this measure of phenotypic similarity would be representative of underlying biological pathways of action. We picked chemicals that were deemed significant for thresholds used above (p=0.001, 0.001, 0.01, for the prostate, lung, and breast cancer datasets) for all of the cancer datasets. This resulted in a total of 130 chemicals, 66 of which had BioActivity data in PubChem. The BioActivity similarity for each of the 66 chemicals was computed through 790 BioAssay scores. Figure 13 shows the –log10 of significance for the highest ranked chemical predictions clustered by their BioActivity similarity.

54 We found some chemicals with similar biological activity profiles in PubChem had similar patterns of chemical-gene set association across the cancer datasets. For example, sodium arsenite, sodium arsenate, and doxorubicin have closely related biological profiles as well as high significance of chemical-gene set association for the prostate and lung cancer data (Figure 13, enclosed in orange box); however, we did not observe other biologically similar chemicals such as Tetradihydrobenzodioxin. On the other hand, we also observed correlation between the biological activity similarity and chemical-gene set association for hormone or steroidal chemicals such as ethinyl estradiol, estradiol, and diethylstilbestrol as well as progesterone and corticosterone (Figure 13, enclosed in purple boxes).

55 Color Key and Histogram 60 40 Count 20 0

0 2 4 6 8 lung cancer breast cancer -log10(pvalue)Value prostate cancer non-smokers non-tumorigenic GSE 6919 GSE 10072 GSE 10072 sodium arsenite sodium arsenate Doxorubicin Tetrachlorodibenzodioxin Thioacetamide Benzene Hydrogen Peroxide Aflatoxin B1 2-Acetylaminofluorene nickel chloride flavopiridol fulvestrant Metribolone alitretinoin vinclozolin tert-Butylhydroperoxide 2-nitrofluorene Carbon Tetrachloride benzyloxycarbonylleucyl-leucyl-leucine aldehyde 4-(N-methyl-N-nitrosamino)-1-(3-pyridyl)-1-butanone Raloxifene Methapyrilene bisphenol A Etoposide Tamoxifen Isotretinoin Lindane Mechlorethamine Cyclosporine Folic Acid Dimethylnitrosamine mono-(2-ethylhexyl)phthalate Ethanol Acrolein Piperonyl Butoxide 3-dinitrobenzene

BioActivity Similarity BioActivity naphthalene Trichloroethylene Fenthion indole-3-carbinol alachlor Cholecalciferol Methylnitronitrosoguanidine Vitamin A Acetaminophen Furosemide hydroquinone Cadmium Chloride pyrazole Am 580 pirinixic acid Fenretinide Genistein Estradiol Ethinyl Estradiol Diethylstilbestrol 4-hydroxytamoxifen Calcitriol Mifepristone Disulfiram Vitamin K 3 Progesterone Tretinoin Corticosterone resveratrol Indomethacin Figure 13. Chemical predictions for Prostate, Lung, and Breast Cancer datasets clustered by liu PubChem BioActivity. Highly significant chemical prediction p-values for the prostate, lung,

and breast cancer datasets (p=0.001, 0.001, 0.01,landi for the prostate, lung, and breast cancer datasets) are reordered by their BioActivity similarity computed by PubChem. A column represents the cancer analyzed and each cell corresponds to the chemical-gene set association –

log10(p-value). Examples of correlationchandran between BioActivity similarity and chemical-gene set significance include the sodium arsenite, sodium arsenate, and Doxorubicin cluster (labeled in orange), the Genistein, Estradiol, Ethinyl Estradiol, and Diethylbisterol and Progesterone, Tretinoin, and Corticosterone clusters (labeled in purple). Other examples of BioActivity similarity and chemical-gene set association include chemicals vinclozolin, tert- Butylhydroperoxide, and Carbon Tetrachloride (outlined in blue).

56

DISCUSSION We have developed a knowledge- and data-driven method to predict chemical associations with gene expression datasets, using publicly available and previously disjoint datasets. Specifically, we have created a functional, gene expression representation of 1,338 environmental chemicals called the Envirome Map (Figure 5D-F) and have developed a quantitative method to query this map. To our knowledge, there are few methods that generate hypotheses regarding environmental associations with disease from gene expression data. Most current approaches in toxicology have focused on a small number of environmental influences on single or small groups of genes, while current approaches in toxicogenomics have been concentrated on measuring genome-wide responses for a few chemicals [158]. Our prediction method enables the generation of hypotheses in a larger scalable manner using existing data, examining the potential role of hundreds of chemicals over thousands of genome-wide measurements and diseases.

As an example, we found predicted chemicals such as sodium arsenite in its association with prostate and lung cancers, estrogenic compounds such as bisphenol A and estradiol with prostate and breast cancers, and dimethylnitrosamine with lung cancer. Although each has curated knowledge behind the association in the CTD, mechanisms for the action are not well known and call for further study. So far, Benbrahim-Talaa et al. have found hypomethylation patterns in the presence of arsenic in prostate cancer cells [114]. Zanesi et al. show a potential interaction role of FHIT gene and dimethylnitrosamine to produce lung cancers [135]. Evidence of a complex mechanistic action of estrogens, such as estradiol, on breast cancer carcinogenesis has been established [159]; however the role of other estrogenic-like compounds have only recently been studied. For example, bisphenol A has been shown to invoke an aggressive response in cancer cell

57 lines [160], possibly by affecting estrogen-dependent pathways [161]. It is evident that more experimentation is required involving the measurements of exposure-affected proteins and genes and their activation state in cellular models and their relation to the chemical signatures.

An overlap of activity of the same genes induced by different chemicals would suggest a common physiological action by these chemicals. For example, the ESR2 and MAPK1 genes in the prostate cancer prediction, and the IL6 and CEBPD in the breast cancer predictions, were associated with several chemicals for each of the diseases. We also found an overlap between chemicals amongst different cancers. This result comes as a result of the correlation in the significant pathways shared by these cancers; however, it may also indicate a need to explore less significant associations in order to find unique and specific gene expression/chemical exposure relationships for a given disease. Furthermore, this result may also indicate a bias of gene and chemical relationships cataloged in the CTD. For example, it could be that genes specific to common cancer-related pathways are those that are well studied, such as BCL2 or ESR2.

Related to this, we have attempted to show how biological activity, as assayed in a high-throughput chemical screen in PubChem, can be correlated with chemical gene-set associations. Observing a correlation in both PubChem- derived bioactivity in addition to a chemical-gene set association from the CTD provides a way to identify shared modes of action among groups of similar or related chemicals. This data serves to both provide internal validation for list of predicted chemicals acting through similar pathways (such as those induced by estrogen) but also to prioritize hypotheses. For example, we did not find curated evidence in the CTD for association of the chemicals vinclozolin, tert-Butylhydroperoxide, and Carbon Tetrachloride to prostate or

58 lung cancers; however, their similar bioactivity profiles (Figure 13, enclosed in blue box) and high chemical-gene set association calls for further review.

We do acknowledge some arbitrariness in our choice of methods and thresholds; most of these were chosen to show significance in our methodology without adding complexity. We could have chosen any of several alternative approaches to implementing our method; however, predictions made with the Gene Set Enrichment Analysis (GSEA) [162] method during the verification phase were not as sensitive (not shown). Another limitation in our first implementation is that in calculating the chemical signatures associating chemicals with gene sets, we ignored the specific degree of expression change (up or down) encoded in the CTD. We decided not to use this information due to the presence of contradictions (some references may point to an increase of exposure-induced gene expression while another reference might claim the opposite), and other preliminary work suggesting that filtering by the degree of change reduced sensitivity (data not shown). Because of these limitations, direction of association cannot be inferred. Further still, we acknowledge richer and more refined chemical signatures along with further integration with resources like PubChem will need to be built to make the most accurate predictions.

Another issue with querying the microarray data of any experiment is the lack of full sample information to stratify results; for example, different exposures may be associated with a subset of the samples. A related concern includes small sample sizes of some of the datasets used to evaluate the method. For example, the best predictive power was seen the largest dataset (prostate cancer, GSE6919), and the worst with one of the smallest, (breast cancer, GSE6883). Despite this heterogeneity and lack of power, we still arrived at noteworthy and literature-backed findings warranting further study. We also

59 urge that more evaluation must occur with datasets that have a larger number of samples.

Most importantly, we stress that these types of association remain as predictions and hypotheses that need validation and verification. The method presented here is not a substitute for traditional toxicology or epidemiology. These studies are required to provide quantitative and population generalizable estimates of disease risk and dose-response relationships. However, as the space of potential environmental chemicals potentially causing biological effects is large, we suggest that this methodology would give investigators at least some clue where to start the search for environmental causal factors to study in these other modes. We believe, like the Connectivity Map, the Envirome Map is a feasible and practical way to represent toxicological response for use in prediction. Predicting a linkage between chemicals, genes, and clinically-relevant disease phenotypes using existing resources falls in line with the National Academies’ vision of high-throughput efforts to decipher genome-wide toxicity response to disease [13].

60 CHAPTER 3. METHODS TO EXECUTE ENVIRONMENT-WIDE ASSOCIATIONS ON DISEASE AND DISEASE-RELATED PHENOTYPES ON POPULATIONS.

INTRODUCTION Complex diseases and adverse phenotypes arise due the contribution of multiple interacting genetic and environmental factors [2], but despite this many recent epidemiological or population-based studies have emphasized the genetic components. For example, the Genome-wide Association Study (GWAS) is a low-cost, commoditized, and popular framework used by researchers to evaluate genetic factors that correlate with disease status on a genome-wide scale [9, 163-165] (Figure 5B). A function of its accessibility and the nature of the simple measurements assayed in GWAS, standards for cross-study comparison and reporting of genetic association in epidemiology have established, in the very least calling for comprehensive, systematic, and agnostic reporting of associations and their validation results. As a result, over 370 GWAS have been published, often over 20 for specific diseases such as T2D [9].

While GWAS has strengthened the epidemiological process and methodology of screening and validating genetic variants, most of the findings attained through the many studies have been unable to explain a large portion of risk variability between individuals and are of modest effect size [32, 33]. Furthermore, variants found have not been able to shed light on biology of disease. One hypothesis for this includes that complex disease arises as a result of sum of effects of variants that are less prevalent than that assayed in GWAS [166]. Another is that these studies have not considered the joint contributions of both genetics and the environment. However, before we may

61 address the latter hypothesis, we must understand what environmental factors are associated with disease.

Despite little relevance of genetic variants found and, more importantly, the fact that diseases arise out of the contribution of both genetics and the environment, there exists no analogous or comparable platform to assay and analyze enviromic associations to disease on an epidemiological scale. Specifically, given multiple environmental factors – or “envirome” (Chapter 1) measured on a population– we ask a analogous question to that asked in genome-wide association studies: what specific environmental factors are associated with a disease or phenotype of interest out of all possible individual environmental factors measured on epidemiological scale? Note that these associations are of different scale than that covered in Chapter 2 (“Mapping Multiple Toxicological Responses To Complex Disease”).

To answer this question, we propose an analogous framework to GWAS, called “Environment-wide association study” (EWAS), to search for and analytically validate environmental factors associated with continuous phenotypes or discrete ones such as disease (Figure 5A). This type of question is different from a hypothesis-driven approach in which candidate environmental factors are chosen a priori and tested individually in their association to a phenotype and analogous to questions facilitated by GWAS.

We begin our description of EWAS by introducing the genome-wide analog, GWAS. Second, we describe the EWAS framework and third, describe differences between genome-wide and envirome-wide epidemiological studies. Fourth, we describe the current EWAS methodology. Last, we discuss our results and posit ways to extend the EWAS methodology. In the following chapter, we describe specific, peer-reviewed, published applications of EWAS.

62 METHODS BACKGROUND

Genome-wide association to disease With the sequencing of the genome and projects that characterized common genetic variation such as the HapMap, investigators are now able to interrogate how genome-wide genetic differences are associated with disease and disease- related phenotypes on an epidemiological scale [25, 167]. These revolutionary studies, known as “genome-wide association studies” (GWAS), have enabled investigators to ask what common genetic loci are associated to a phenotype in an agnostic, systematic, and comprehensive way with explicit control of multiplicity.

Specifically, during the HapMap project, common single nucleotide (SNP) variants and have been catalogued on basis of their population frequency (≥ 10% population frequency), and major and minor allele versions [168]. The location of each SNP along the genome is referred to as a “locus” and the presence of variation at a particular locus denotes a “polymorphism” or a “polymorphic” locus. “Common” polymorphisms are those that occur at approximately greater than 5-10% in the population. Thus, by definition, a “common” SNP must reside at a polymorphic locus. There are greater than 1 million common SNPs in the genome [25]. While SNPs are the most common type of polymorphism in the genome accounting for 90% of genetic variation, many other types of genetic variation exists, such as copy number variants, insertions, and deletions.

GWAS relate traits to variation at each – or a large subset of—common polymorphic locus in the genome and are enabled by genomic technologies, known as “SNP microarrays”, which can assay greater than 1 million loci simultaneously for an individual. These microarrays are now mere commodity items, like computers, making accessible genome-wide measurements on a

63 large number of individuals [169]. Further, these technology platforms are known to have very low measurement error [10].

GWAS are constructed by recruiting thousands of individuals with (“cases”) and without a trait or disease (“controls”). Genotype frequency at each locus across the genome are then compared between cases and controls using common statistical tests such as chi-squared test [8], assuming the independence between each locus. Continuous traits, such as levels of a biomarker, may also be related to genetic variation using by modeling the continuous phenotype in a linear regression model [170]. Multiple comparisons are accounted for through conservative Bonferroni adjustment and significant loci are validated in independent populations, often (but not always) of differing demographic character than of the original screen.

Preceding GWAS were “candidate gene studies”, a hypothesis-driven study to correlate a handful of genetic variants to a trait of interest using a “smaller” sample size. As a consequence of lack of power and prohibitive genotyping cost, the agnostic, comprehensive, and systematic analytical and validation procedure of GWAS eluded traditional genetic association studies [32, 33, 171-173]. To facilitate discussion regarding “Environment-wide association”, we describe these “agnostic”, “systematic”, and “comprehensive” characteristics of GWAS.

GWAS is agnostic and data-driven, not hypothesis-driven. Traditionally, genetic epidemiology association studies were hypothesis driven, testing a handful of genetic variants at a time against a phenotype. The process of GWAS also calls for both systematic associations both within an individual study and between multiple studies. The process of simultaneous association requires accounting for multiple testing, controlling for the family-wise error rate and false positives. a notable problem in the fragmented literature. Often

64 the threshold for significance is fixed a priori (Bonferroni correction). Second, GWAS significant results are validated in additional populations at the same stringent level. Last, and related to the agnostic characteristic, GWAS becomes close to comprehension: each common variant present on the measurement chip is associated to the phenotype and its strength of association is reported in context of all other common genotypes assayed (as seen in the “Manhattan plot”, Figure 5A).

Environment-wide association to disease In the following, we propose a study design analogous to GWAS, called “environment-wide association study” (EWAS) to search for and analytically validate environmental factors associated with complex diseases and phenotypes.

EWAS assumes similar “data structure” to that of GWAS. Recall that in GWAS, multiple genetic factors are assayed along with phenotypic information on each individual (Figure 14). In other words, the genetic factors are the independent variables, and the phenotype is the dependent variable. In EWAS, the genome domain are substituted with envirome domain (Figure 5A). Specifically, the quantity or presence of environmental factors is directly measured on each individual, such as the amount of a chemical in bodily tissue, or a proxy measure, such as self-report historical exposure (Chapter 2). This is in contrast to data that are self-report and subject to bias, “ecological” [11], data summarized on a level higher than that of individuals but on samples grouped by some common characteristic, such as family, social network [174], and town or city regions [175]. As discussed in Chapter 1 and further below, the environment is a dynamic entity, unlike the data structure of GWAS. Thus, the dimension of time may also be added to the structure of EWAS data structure, framing it in a longitudinal context. While we describe methods to

65 accommodate a longitudinal data structure below, specific applications described do not consider time (Figure 14).

A

Class Class B Class Z

1 2 3 ... p phenotypesex age ethnicitySES X X X X

sample1 1 M 20 B 3 0.5 + NA -.2

sample2 0 M 35 W 1 1 - NA NA

sample3 0 F 55 M 1 0 - 0 0

sample4 1 M 60 W 2 -.2 - 1 -1

sample5 0 M 10 A 2 -.3 - 2 1 ......

samplen 1 F 40 W 2 0 - 3 ... .3

Figure 14. Sample data structure for EWAS. “Phenotype” is the dependent variable. “Sex”, “Age”, “ethnicity”, “SES” (socioeconomic status) are examples of adjustment variables. X1 through Xp are environmental factors; sample1…samplen are the individuals that make up the sample. Values inside each cell denote an example of the data type for the variable. For example, “Phenotype“ here is a binary variable taking on 1 if the phenotype is present, 0 if absent; “sex” is a categorical variable for males and females. X variables representing environmental factor may be continuous (e.g., X1, Xp), positive/negative (e.g., X2), or ordinal (X3). Data might be missing (e.g., NA cells). The vertical axis denotes individual in the sample. Each environmental factor belongs to a disjoint “class”, or grouping, that represents a common characteristic of those factors, represented in the figure as “Class A”, “Class B”, and “Class Z”.

GWAS variables are “binned” by their chromosomal location, facilitating the description of their correlation structure – known as Linkage Disequilibrium (LD)-- when visualizing associations. Specifically, LD is the correlation of the two loci in the genome. Further, LD is a function of relative location of the two loci; that is, the closer together two loci are on a in general, the higher their LD. Suppose we are considering one locus: in this scenario, we inherit alleles from our parents, one from the mother and one from the father. The genotype at one locus is a random event and is dependent on the frequency of alleles present at that one locus in the mother and father. Now

66 suppose two loci (two sets of genotypes) are in “LD”. This means that their pattern of inheritance are correlated; that is the occurrence of a particular allele “A” at a locus A and “B” at a locus B are non-random, or dependent with respect to one another. In other words, the presence of one allele can predict the presence of another. LD among different populations has been characterized by the HapMap project and is ongoing with the 1000 Genomes project [25]. In GWAS, LD structure “buys” us several things. First, since we are but only assaying a prevalent subset of polymorphic loci, LD allows us to narrow down what variants might be causal; for example, given an association signal for a variant, the causal variant might be one in strong LD with it. LD also gives us an internal gauge of validity; for example, given a strong association signal of a variant at loci X, one would expect measured common variants that are also in LD with X to also harbor some signal.

At present, LD in EWAS is qualitative not quantitative as in GWAS. In our applications (Chapter 4) we binned factors according to categories that described the compound “class”, had shared environmental health “relevance”, or described some other arbitrary shared characteristic as a group of factors. Current categories and examples within each are seen in Chapter 1. We anticipate, as investigators characterize the envirome, that these categories will encompass assays for stress, microbial flora, drugs, noise, and ecological measures. A research effort will be to fully characterize the LD of the “envirome”, including their correlation/covariance structure and population- wide prevalence as has been done with the HapMap.

EWAS achieves the agnostic, systematic, and comprehensive qualities that characterize GWAS. First, instead of testing a few environmental associations at a time, EWAS evaluates multiple environmental factors agnostically. EWAS is comprehensive in that each factor measured is associated with phenotype. Next, associations are systematically adjusted for multiplicity of

67 comparisons. Further, EWAS calls for validation of significant associations in an independent population.

The EWAS framework calls for systematic and comprehensive sensitivity analyses of highly significant or validated factors. Specifically, all possible measured confounders are included in final models and their effect on the estimate of the environmental factor is assessed. Last, given the dense web of correlation for non-genetic measures, such as between environmental factors and clinical measures, the correlation structure between validated environmental factors and risk factors are systematically computed and visualized to understand the degree of their interdependence. By visualizing relationships in this way, we can infer groups of non-independent exposures associated with phenotype, similar to “relevance network” or clustering analyses [176, 177].

Genetic versus non-genetic associations in population scaled studies While EWAS has been inspired by GWAS, there are both critical differences and similar drawbacks between genetic versus non-genetic epidemiology. In the following, we discuss these differences and similarities, between 1.) current day non-genetic association studies versus GWAS and, 2.) the new paradigm of non-genetic association studies, or EWAS, and GWAS. Work done by Ioannidis et al. guides this section [10].

Current association studies seeking association between environmental or non- genetic factors and phenotypes test a few factors at a time. Results may be further biased by selective reporting of subsets of analyses, phenotypes, and adjustments, leading to fragmented body of literature [10, 178-180]. Second, related to selective reporting of subsets of analyses, consideration of multiplicity of tests are not considered. Current environmental epidemiology

68 studies are not agnostic, systematic, and comprehensive; however, the EWAS analytic method amends these differences as described in the previous section.

However, there remain some critical differences and drawbacks between the new paradigm of “enviromic” and genome-wide association. First, high- throughput, low-error, commoditized, assay technologies have facilitated systematic, agnostic, and comprehensive interrogation of genome-wide variants. An analogous high-throughput and low-error assay technology platform does not exist for the environmental factors.

Of course, a high-throughput assaying technology can be realized only after the domain of what to measure – common loci of the genome – have been characterized. The HapMap project has enabled us to characterize the variability across the genome. Further, as a result of this characterization, we also have an idea of how genetic variants are “correlated”, or the pattern of linkage disequilibrium.

We are far from describing the “LD” of the envirome, let alone what environmental factors make up the envirome (Chapter 1). However, in our own applications (Chapter 4) and from others we know that the correlation matrix of environmental variables is dense [181]; many variables are correlated with each other strongly. Therefore, it is difficult to pinpoint both what factors are independently associated with the phenotype and the directionality of association.

Issues related to observational studies influence all association studies, be it from hypothesis-driven candidate factor study, GWAS, or EWAS. In contrast to “gold-standard” randomized trial study data, both genetic and non-genetic studies rely on observational study data, such as longitudinal cohort, case- control, or cross-sectional data. Both types of epidemiological studies are

69 subject to confounding biases that hinder causal inference and are avoided, to some degree, in randomized studies [182]; however, the gold-standard scenario of a clinical trial is not suited for agnostic study of the envirome as it is impossible to randomize such a matrix of factors.

“Confounding” is used to describe a scenario in which a variable is correlated with both the factor of interest (the independent variable) and phenotype (dependent variable) [183]; in our analyses, the factor acts as a “proxy” to the confounding variable, resulting in a false association between the dependent and independent variable. A partial solution to this type of bias is including the confounder as a covariate in the statistical model, or “controlling” for the confounder. This of course is only possible when the confounder is known and measured.

In modern-day genome-wide studies, a notable example of confounding included the initial association of a variant belonging to the FTO gene and T2D [165]. In subsequent analyses adjusting for body mass index (BMI), a clinical risk factor associated with both T2D and the FTO variant, the association was nullified. Subsequently, FTO was shown and validated in its association to BMI and obesity in GWAS [170]. Confounding is a major issue in non-genetic studies, especially noting the dense correlation structure of non-genetic and environmental variables and many such examples exist of associations biased by confounders. Famous examples include associations derived from observational studies later contradicted by randomized control trials (RCT): 1.) β-carotene, thought to have mute risk for smoking-induced cancer [184], only to be refuted by a RCT later [185], 2.) same with of vitamin E and decreased risk of coronary heart disease (CHD) [186], and even, 3.) for vitamin C and CHD, relative risks between of observational studies and RCTs had even switched direction [69]!

70 Another source of “bias” includes “reverse causality”, or reverse association. Reverse causality leads to the failure to infer proper “forward” direction between the independent variable and dependent variable, the phenotype. Specifically, it occurs when the independent variable comes directly or indirectly as a result of the dependent variable. For example of this in includes a sample-wide behavioral shift due to the dependent variable, such as increased intake of a vitamin due to an adverse phenotype. If we were to associate the environmental factor, the vitamin, with the phenotype as the dependent variable, the interpretation of the model as is suggests that change in vitamin exposure leads to change in phenotype when in fact the opposite is true. These biases are especially present in case-control or cross-sectional studies in which individuals are measured at one point in time. A way to take into account the dynamic nature of non-genetic variables and biases such as reverse causality includes conducting a longitudinal study in which we may observe jointly changes in phenotype and exposure pattern as a function of time [34]. Lastly, the notion of reverse causality is a non-issue in genetic variant association studies due to the static state of nucleotide variants.

The nature of the environmental factor themselves also biases results. First, the assessment of the quantity of environmental factors in blood and serum is subject to measurement error [10] and self-report variables are subject to recall bias. Further, physiological characteristics of factors themselves influence estimates, including the variability of the kinetics of chemical factors, such as how long they are retained in accessible body tissue. For example, chemical compounds that are easily measured include those that are lipophilic, persistent in fatty tissue. As adiposity is related to both the measurement of the factor and often the phenotype of interest (e.g., metabolic syndrome), a positive correlation might indicate confounding. On the other hand, many types of factors are excreted quickly, also affecting their measurement and association to the phenotype of interest; however, “steady-state” or constant exposure

71 might allay a kinetic effect of environmental chemicals [187]. Nevertheless, in genetic studies, these issues are altogether avoided: error rates of array-based assays and DNA sequencing are miniscule to that of environmental factors.

EWAS METHOD The EWAS methodology and analysis framework is analogous to that utilized in GWAS. First, we conduct an initial scan for environmental factors associated with a phenotype of interest through general linear modeling, such as logistic or linear regression. Since environmental association occurs in the observational (vs. randomized scenario), these models include variables that adjust for known confounders, such as clinical risk factors. Second, we account for multiple hypotheses by estimating the false discovery rate (FDR). Third, factors that we deem significantly associated with the phenotype beyond the region of false discovery are “validated” in independent cohorts. Factors that are validated are considered true discoveries.

The EWAS framework also calls for systematic sensitivity analyses, whereby validated factors are modeled under different assumptions or with additional covariates. Further, the pair-wise correlation between each validated factor is computed and examined to determine their dependence, which can be interpreted as potential evidence for route of exposure or confounding. Each step is described further.

Stage 1: Linear Modeling Each environmental factor is associated with a phenotype of interest using general linear models; for example, each associated with disease status using logistic regression. Normally distributed continuous phenotypes are correlated to environmental factors with linear regression. Common risk, demographic, and clinical factors are added as adjusting variables, such as age, sex, ethnicity,

72 socioeconomic status, as phenotypic states and environmental factors are confounded by these variables. Thus, for an environmental factor Xi in our list of measured factors Xi … Xp we model the disease state (Y) as a linear function of environmental factors and adjustment variables (represented by Z):

Y = α + βI Xi + ζ Z

Xi corresponds to the environmental factor and βi corresponds to the effect size of that factor, adjusted by other variables.

The strength of association is computed by the 2-sided p-value for βi, which tests the “null hypothesis” that βi is equal to zero. When modeling the phenotype as the logit (logistic regression), the exponentiation of βi serves as the odds ratio, or the change in the odds in disease versus un-diseased status for a unit change of the factor. In the linear regression setting, βI can be interpreted as the change in phenotype per unit change of the factor. In summary, the screening procedure of stage 1 can be described by this pseudo- code:

1. Pvalues <- NewList() 2. Effect_sizes <- NewList() 3. For xi in [X1…Xp]: 4. Modi <- GeneralLinearModel(phenotype,xi,Xses,Xeth,Xsex,Xage) 5. ListAppend(Pvalues, getPvalue(Modi, xi)) 6. ListAppend(Effect_sizes, getEffectSize(Modi, xi)) Algorithm 1. Screening for Environmental Factors (Stage 1) of EWAS.

In Algorithm 1, line 1 and 2 initialize an array data structure to store p-values and effect sizes (coefficients) for each environmental factor. In line 3, we compute a linear model (‘GeneralLinearModel’) that models phenotype as a function of environmental factor Xi and the adjusting factors. We simply take the p-value and coefficient from each model and store them in our list. P- values are computed through common tests of significance, such as Wald tests.

73 Continuous factors are z-transformed (centered about the mean and divided by their standard deviation) in order to compare the effect sizes. Many factor measured in tissue have a right skew and thus are log-transformed prior to z- transformation. Binary factors (such as presence or absence of a factor) are standardized such that effect size reflects a unit change between exposed and un-exposed status; that is, the referent is consistently the “negative” result of a binary test. Ordinal factors are left untransformed.

Stage 2: Controlling for Multiple Hypotheses by Estimating the False Discovery Rate Given a set of “discoveries”, or a list of potentially significant factors, how can we deem those that are false discoveries? In the GWAS setting, Bonferroni correction is utilized to adjust for multiple comparisons. The Bonferroni adjustment is straightforward: it simply divides the significance threshold α for the total number of tests conducted. This adjustment guarantees the “family- wide error rate” – the probability of having one or more false positive in a set of results is equivalent to a setting in which only one hypothesis was tested at level α. However, the threshold is conservative and therefore we lose power for detection.

To account for multiple comparisons, we compute an empirical estimate of the False Discovery Rate (FDR) derived through permutations of the phenotype multiple times, effectively creating a “null distribution” of test statistics. In contrast to the Bonferroni correction, the FDR provides a quantitative estimate of the number of false positives in a set of “discoveries”. The FDR is less conservative and therefore more powerful than the Bonferroni correction [38]. Further, since our estimate of the FDR utilizes the data itself, it inherently considers the covariance structure of the data, an important quality given the dense correlation of non-genetic factors [38].

74 The FDR is the estimated proportion of false discoveries made versus the number of real discoveries made for a given significance level α, to control for multiple hypothesis testing. To estimate the number of false discoveries, we create a “null distribution” of regression test statistics shuffling the phenotype a large number of times (100-1000) and refit the regression models. The FDR is the ratio of the proportion of results that were called significant at a given level α in the null distribution and the proportion of results called significant from our real tests. We use a significance level that corresponds to FDR of 5- 10% to select associations.

The pseudo-code to compute the FDR follows: 1. Do: ‘Algorithm 1’. 2. nullPvalues <- NewList() 3. For i in [1…numberPermutations]: 4 randomPheno <- permutePhenotypeWithoutReplacement(phenotype) 5. For xi in [X1…Xp]: 6. Modi <-GeneralLinearModel(randomPheno,xi,Xses,Xeth,Xsex,Xage) 7. ListAppend(nullPvalues, getPvalue(Modeli, xi)) 8. fdrRaw <- [] 9. for pvalue in Pvalues: 10. numerator <- sum(nullPvalues < pvalue)/numberPermutations 11. denominator <- sum(Pvalues < pvalue) 12. listAppend(fdrRaw, numerator/denominator) 13. fdrs <- [] 14. for I in [1…p]: 15. fdr <- min(rawFdr[i…p]) 16. ListAppend(fdrs, fdr)

Algorithm 2. Computing the FDR (q-value) for each p-value during Stage 1 of EWAS.

To begin algorithm 2, we need to have established the stage 1 of EWAS with Algorithm 1. Then, for a number of permutations, we refit the regression model for the random phenotype for each environmental factor and collect all of these ‘null’ p-values (line 3-7). For each p-value computed in Stage 1, we compute the raw FDR, or the ratio if raw number of results that are exceed that p-value threshold in the permuted data and the number of results that exceed that p-value in stage 1 (line 11,12,13). As FDR should be a monotonically increasing function of the p-value, we ensure that the FDR for a p-value is the

75 minimum of the FDRs for all p-values equal to or greater than that p-value (line 15). The resulting array of FDR values corresponds to the FDR for each p-value computed in Stage 1.

Of course, the original method for estimating the FDR can be used [39], eliminating the need for Algorithm 2. However, as discussed earlier, estimating the FDR through permutations of the dependent variable is preferred in the scenario in which the variables are correlated. In addition, much has been documented about what variables to permute or bootstrap. For example, it has been suggested that model residuals, the difference between the predicted and true values, should be permuted (or bootstrapped) as opposed to the original outcome variables (replacing line 4 in Algorithm 4 appropriately). In our experience (Chapter 4), we had similar estimates of the FDR under different documented methods of permuting. The reader is advised to refer to Manly, Efron, and Westfall and Young for more in this area [188-190].

Stage 3: Validation Findings deemed significant corresponding to some nominal FDR level are validated in one or more additional independent cohorts. As a rule, the significance level of the validation result must be the same or more stringent FDR level as the initial cohort. For example, if a factor is deemed significant at FDR of 10% in one cohort, it must also have an FDR of 10% in one of the validation cohorts. Furthermore, and importantly, the sign of the effect size in the validation cohort must be equivalent to that in the initial screen.

We also compute the empirical FDR of the validation step, the overall FDR of validating a factor. We first estimate the number of false positives by counting the number of factors found significant at level α in multiple cohorts from the permuted analyses. For example, to assess the FDR of validating a factor in 2 cohorts, we collected the factors that fell below the significance threshold α in

76 the permuted data corresponding to two different cohorts and counted the number of factors found significant in both. We repeated this operation on all possible pairs of cohorts, adding up numbers found to be significant in each pair. We then estimate the FDR by computing the ratio between the total number of false positives and the number of true validated factors (factors found to be significant in more than 1 cohort). We repeated the analogous operation for factors significant in however many cohorts we use to validate our results. The pseudo code for this procedure follows: 1. numberOfCohorts <- numberOf(cohorts) 2. fdrThreshold = 10% 3. significantFactors <- NewHash(key=factor,init=0) 4. significantNullFactors <- NewHash(key=factor,init=0)

5. For cohort in cohorts: 6. Do ‘Algorithm 1’. 7. Do ‘Algorithm 2’. 8. PvalueThresh <- max(cohortPvalue[fdr < fdrThreshold]) 9. signficantFactorsInCohort = whichFactors(Pvalues < pvalueThresh) 10. signficantFactorsInNullCohort = whichFactors(nullPValues < PvalueThreshold) 11. for factor in significantFactorsInCohort: 12. significantFactors[factor] ++ 13. for factor in significantFactorsNullInCohort: 14. significantNullFactors[factor] ++

15. validatedFactors <- NewHash(key=numCohorts) 16. for numCohort in [2..numCohorts]: 17. for factor in significantFactors: 18. if(significantFactors[factor] >= numCohort):# need to check the effect size direction 19. validatedFactors[numCohort] ++

20. nullValidatedFactors <- NewHash(key=numCohorts) 21. for numCohort in [2..numCohorts]: 22. for factor in significantNullFactors: 23. if(significantNullFactors [factor] >= numCohort): 24. nullValidatedFactors [numCohort] ++

25. validatedFDR <- NewHash(key=numCohorts) 26. for numCohort in [2..numCohorts]: 27. falsePosValidRate <- nullValidatedFactors[numCohort]/numPermutations 28. fdr <- falsePosValidRate / validatedFactors[numCohort] 29. validatedFDR[numCohorts] <- fdr

Algorithm 3. Computing the FDR for the multi-cohort validation.

77 In line 1, we retrieve the number of independent cohorts we use to tentatively validate a significant result, and initialize our significance threshold for a finding in 1 cohort (eg FDR < 10%). We then initialize two hash data structures which contain the number of cohorts where the factor is significant, indexed by the environmental factor name string and do the same for ‘null’ results, or results attained through permutation of the phenotype label (line 3 and 4). Next, for each individual cohort, we do our EWAS stage 1 screen (line 6) and compute the within-cohort FDR (line 7). Then, we collect the number of significant factors that exceed the nominal FDR threshold (lines 8-10). Then, we iterate through these significant factors and increment a counter for the factor (lines 11-12); for factors that are tentatively validated, the count will be greater than 1. We do the analogous operation for the permuted dataset (lines 13-14). Next, we count the number of validated factors by totaling up the factors that had a count greater than 1 (lines 15-19) and the analogous operation for the permuted dataset (lines 20-24). Finally, we compute the FDR of validating a factor in lines 25-29: for each possible validation scenario numCohort (where numCohort is the number of cohorts where a significant is in 2, 3, 4, etc cohorts) we estimate the false positive rate as the number of “validated” findings, or rate a factor was found to be significant in numCohort permuted cohorts divided by the actual number of factors validated in the real dataset. Thus, this FDR corresponds validating a factor with the significance rule of FDR for a single cohort.

Final estimates for validated factors are computed by combining independent cohorts. Tests for heterogeneity between cohorts are also performed to ensure the final overall estimate is unbiased by any specific cohort.

Stage 4: Sensitivity Analyses Confounding and reverse causality influence the strength of association, bias the effect size estimate, and in general affect causal inference of environmental

78 factors to phenotypes. Thus, we propose a method to begin to measure these biases approximately. However, we cannot claim to find these biases nor eliminate confounding; nevertheless, we describe methods to assess bias given that they were measured.

In the first, we systematically comb through all measured variables that were not considered in our list of environmental factors – but could influence the association – and sequentially add them to the linear model as an additional covariate. Then the p-value of association and effect size corresponding to the environmental factor calculated from the extended model is compared original model computed in Stage 2. The difference between the extended and original factor coefficients quantifies the approximate bias due to the new variable.

Types of variables that might bias our associations depend on the phenotype and environmental factors under study, but often include knowledge of clinical status (e.g., diagnosis of a disease), recent food, supplement, or drug intake, and physical activity. For example, knowledge regarding one’s disease state might induce behavioral change, resulting in increased exposure to foods high in vitamins and certain nutrients; association between these vitamin factors and disease might then be attributed to reverse causality. Or, use of a drug might induce phenotypic change, biasing estimated effects toward the null.

This method is dependent on a multitude of measured potential confounders. Large epidemiological datasets arising from the public domain or of large consortia often measure many of these other clinical and behavioral non- genetic variables which can be utilized to test the “sensitivity” of the final validated effects of environmental factors associated with a phenotype. We give specific examples of our sensitivity analyses when covering applications in sections below.

79 Stage 5: Correlation Globes The correlation/covariance structure between non-genetic measures are known to be “dense”, and this structure also influences our ability to infer the independent effect of factors on phenotype as discovered in EWAS. Furthermore, our initial screen methodology assumes independence between factors and we therefore have little idea about their correlation.

Concretely, given a list of discovered factors, their joint association to the phenotype of interest might be due to their correlation, such as similar routes of exposure. We assess the degree of dependency between validated factors by computing their raw correlation coefficient (Pearson’s ρ) and visualizing this with a correlation “globe”. By visualizing relationships in this way, we can infer non-independent exposures associated with phenotype [176, 177].

DISCUSSION As described above, EWAS may facilitate many different ways of screening for factors. We describe extensions that might be used off-the-shelf to accommodate longitudinal data and statistical learning methods that consider the entire matrix of dependent variables at once.

Longitudinal data As discussed, environmental factors are dynamic. One way to capture the dynamic relationship between environmental factors and a phenotype of interest includes repeatedly measuring individuals over time. An example includes a longitudinal cohort study, in which a cohort is followed for a certain amount of time beginning prior to disease onset, such as childhood or adolescence. This type of study design might lessen the bias of reverse causality, but not completely [34].

80 For a binary dependent variable, the Cox proportional hazard model is a common analytic model that can accommodate both time-dependent independent and dependent variables. With this model, we simply substitute line 4 of algorithm 1 with the Cox model that inputs time-dependent variables. For both continuous and dependent variables, hierarchical modeling techniques such as generalized estimating equations may be utilized. The EWAS as described by algorithm 1 depends on the computation of individual p-values and effect size for the environmental factor, and statistical tests for these modeling techniques provide this requirement. Calculation of the empirical FDR proceeds also in the same way [191].

Feature Selection: Shrinkage Methods The EWAS screening method considers each environmental factor in a separate linear model iteratively (algorithm 1). This makes feasible the screening and interpretation of many variables and not over-fitting the linear model (i.e., p << n, where p are the number of predictors, n are the number of individuals). However, this falsely assumes independence between environmental factors. Statistical learning methods, such as “shrinkage” methods, enable one to model the dependent variables simultaneously in the “over-determined” (p ≥ n) setting.

2 such popular shrinkage methods include the “Lasso” [192] and “elastic net” [193]. These methods are extensions of multivariate regression and have some relation to tree “boosting” methods [194] and are applicable over the generalized linear model family, including Cox proportional hazards for longitudinal data [191]. Both the lasso and elastic net are able to fit an over- determined model by constraining the size of coefficients (“shrinking”). Because these methods consider the entire set of independent variables simultaneously (ie multiple regression), algorithm 1 is supplanted with the shrinkage procedure. Further, k-fold cross-validation is utilized to select

81 features that have the lowest prediction variability on k number of datasets held out of the model building process [194].

Feature selection operates through optimizing prediction accuracy of the dependent variable and not by through ordering of test-statistics of individual coefficients used in inference. Thus, we must re-configure parts of the Stage 1 (FDR estimation) and Stage 2 (Validation) to accommodate this. Reconfiguring Stage 1, we use one cohort as the “discovery” cohort, applying the shrinkage method to find factors associated with the phenotype. Within this cohort, k-fold cross-validation is applied in order to optimize prediction accuracy with prediction cohort. Thereafter, the top factors found through this method are “validated” individually in additional validation cohorts using common tools for inference (e.g., GLM). Successful validation requires low nominal p and FDR values for the validation analyses.

Of course, “classical” methods for feature selection exist in the linear regression domain, such as “forward-stepwise” and “backward-stepwise”. These methods may be used to select environmental factors, but we opt out of discussion of these methods due to their high variability in subset selection due to the step-wise procedure, ultimately reducing their prediction accuracy [195]. The shrinkage methods discussed above avoid this problem.

In this chapter, we have presented a straightforward and generalizable way to associate environmental variables of large dimension to disease. Furthermore, we present a way of ranking what variables we may want to pursue for further study through computation of the FDR. Because of its proposed utility, the method has become a center point of discussion and debate [1, 196-200]. In the following chapter, we demonstrate this claimed utility, applying the method to Type 2 Diabetes and Serum Lipid Levels, risk factors for cardiovascular disease.

82

CHAPTER 4: ENVIRONMENT-WIDE ASSOCIATIONS TO DISEASE AND ADVERSE PHENOTYPES: APPLICATIONS TO TYPE 2 DIABETES (T2D) AND SERUM LIPID LEVELS

INTRODUCTION In the following, we exemplify methods and techniques presented in the previous chapter with published or submitted “Environment-wide Association Studies” (EWAS) on diseases including type 2 diabetes (T2D) [35] as well as on phenotypes that are risk factors for disease, such as serum lipid levels [36]. As described in the previous chapter, EWAS is a framework to comprehensively and systematically test for environmental association to disease analogous to “Genome-wide association Study” (GWAS), a now standard framework in genetic epidemiology to associate genetic variants in a genome-wide dimension to disease.

The EWASes presented concern complex disease which known to be multifactorial in etiology in which both many environmental and genetic factors are known to play a role [2]. Second, they are of great concern given their rise to epidemic status [201]. Third, through GWAS, we have a robust set of common genetic variants associated to these diseases, for example T2D [202] and lipid levels [65] executed on samples of significant size. Furthermore, and most importantly, this list of genomic loci is being updated and examined continuously [9, 203]; however, we lag behind in identifying the comprehensive set of environmental factors (Chapter 1).

The following studies are made possible by the National Health and Nutrition Examination Survey (NHANES), a representative biannual health survey of non-institutionalized population of the US [37]. In NHANES, participants are

83 queried regarding their health status and an extensive battery of clinical and laboratory tests are performed on a subset of these individuals. Specific environmental attributes are assayed, such as chemical toxins, pollutants, allergens, bacterial/viral organisms, and nutrients. Of biomedical relevance, we identified novel environmental factors such as nutrients and industrial pollutants associated with these diseases that should be examined in follow-up validation studies.

ENVIRONMENT-WIDE ASSOCIATION STUDY ON TYPE 2 DIABETES

EWAS on T2D: Methods We associate 266 unique environmental factors to T2D status from the NHANES. We downloaded the all of the available NHANES data for 1999- 2000, 2001-2002, 2003-2004, and 2005-2006 cohorts and collated corresponding variables across them. For example, if a variable LBXVIE from 1999-2000 described “A-Tocopherol ug/dL” and variable with name LBXATC from 2001-2002 also described “a-tocopherol ug/dL”, we applied the same name for each, LBXATC.

Figure 15 presents a schematic representation of our analysis methodology. We analyzed all environmental factors from the NHANES that were a direct measurement of an environmental attribute, such as the amount of pesticide or heavy metal present in urine or blood. We did not consider internal biological system laboratory measures such as red blood cell count, triglyceride level, cholesterol level, or other physiological measures. By using direct and quantitative measures of factors, we potentially avoid issues of self-report bias.

84 There was a total of 543 factors in our EWAS, but not all factors were present in all cohorts: 111 factors measured in the 1999-2000 cohort, 146 from 2001- 2002, 211 from 2003-2004, and 75 from 2005-2006. This comprised of 266 unique environmental factors in total, with 157 factors measured in more than one cohort. Using NHANES categorization, we binned factors into 21 “class” groupings in order to discern patterns among related groups of factors, analogous to chromosomal units in GWAS (not shown). Different environmental factors were measured in varying numbers of participants, ranging from 507 to 3318 individuals over the different environmental factors.

85 A 1999-20002001-20022003-20042005-2006 Acrylamide 0 0 2 0 Allergen Test 0 0 0 20 Bacterial 8 13 17 1 B Cotinine 1 1 1 1 Diakyl 7 7 6 0 Dioxins 5 7 7 0 Furans 5 5 9 0 Heavy Metals 18 18 23 25 Hydrocarbons 14 22 21 0 Latex 1 0 0 0 Carotenoid Nutrients 0 6 15 7 Mineral Nutrients 2 2 2 1 Vitamin A 3 3 3 3 Vitamin B 4 4 5 3 Vitamin C 0 0 1 1 Fasting Glucose Vitamin D 0 1 1 1 > 125 mg/dL? Vitamin E 2 2 3 2 N=109-3190 Polychlorinated Biphenyls 23 26 38 0 (8% of total) Perchlorate 0 0 2 0 Pesticides, Atrazine 0 0 5 0 Pesticides, Carbamate 0 0 1 0 log10(trigly.) Factor "Classes" Environmental Pesticides, Chlorophenol 0 0 1 1 N=109-3618 Pesticides, Organochlorine 10 13 11 0 Pesticides, Organophosphate 2 2 2 0 Pesticides, Pyrethyroid 1 1 1 0 log10(LDL) Phenols 15 11 9 12 N=101-3368 Phthalates 7 12 12 0 Phytoestrogens 6 6 6 0 Polybrominated Ethers 0 0 12 0 log10(HDL) Polyflourochemicals 0 0 10 12 N=222-7485 Virus 6 6 10 6 Volatile Compounds 29 14 22 0 total 169 182 258 96

C

zfactor = transformed xfactor - adjustment variables phenotype z factor βfactor D FDR(α) ≤ 10% Empirical FDR estimation P-value(β ) < α in 2 or more cohorts? Permute Phenotype Levels factor 1000x E Combined Cohort βfactor compute estimate of validated factor using all cohorts

F Estimation of R2 G Sensitivity Analyses Recompute βfactor, adjusted by self-report data: any metabolic health history physical activity any use of drugs (metformin, statins, etc) total supplement use 24- or 48-hour dietary recall (n=58)

H Correlation Globes of Tentatively Validated Factors (ρ > 0.2)

86 Figure 15. A.) Summary of the 32 factor classes and the number of factors within them for each NHANES cohort. Each factor is measured in blood or urine.. B.) 100-7,500 individuals had their fasting blood glucose (FBG), HDL-C, LDL-C and triglyceride levels measured for each of these factors in each cohort; lipid levels were log transformed to assume normality for least squares regression. Type 2 Diabetes status was assessed by considering those who had a FBG > 125 mg/dL. C.) Each of these 96 to 258 factors was tested for association with the logarithm of HDL-C, LDL-C, and triglyceride level with a linear regression model adjusted for age, age-squared, sex, BMI, ethnicity, and SES. To test against T2D status, a logistic regression model was utilized, adjusting for age, sex, ethnicity, SES, and BMI. D.) To account for multiple testing, we estimated the empirical null distribution by permuting the lipid levels and estimating the false-discovery rate (FDR). The p-value threshold (α) for statistical significance was determined by controlling the FDR to be under 10%. We deemed a factor to be tentatively validated if it was found to be significant in 2 or more cohorts with an effect in the same direction in all cohorts where it was significant. E.) For lipid level phenotypes, we estimated a final coefficient for tentatively validated factors by combining all cohorts and adjusting for age, age-squared, sex, ethnicity, SES, BMI, waist circumference, T2D status, blood pressure, and cohort. F.) We estimated the coefficient of determination (R2) for the final, combined models. G.) We re-computed our final models, adding 62 self-report variables one- by-one to attempting to check the validity of the environmental effect. H.) We computed the pair-wise correlation between each of the tentatively validated factors along with other clinical co-variates and analyzed these relationships with correlation globes [10].

We omitted from our EWAS 73 factors that varied little across individuals in our sample. Specifically, we omitted those that had a majority (> 90%) of the observations below a detection limit threshold as defined by in the NHANES codebook. We also removed factors that targeted a subset of the population, such as the test for Trichomonas vaginalis, an infectious pathogen found primarily in women.

T2D cases were individuals who had a fasting blood glucose (FBG) level greater or equal to 126 mg/dL, as advised by the American Diabetes Association (ADA) [204] (Figure 15B). We chose specificity and accuracy of diagnosis over sensitivity, as we acknowledge this definition ignores those who were previously diagnosed as diabetic, but now keep their blood glucose under tight control; in fact, a larger proportion of NHANES respondents described themselves as diabetics or were taking medications often used to treat diabetes than were classified by FBG levels. Neither FBG nor the self- reported diabetes status distinguishes between Type 1 Diabetes (T1D) and

87 T2D; as T2D has a prevalence rate more than 40 times higher than T1D, we assumed all our cases have T2D.

We used survey-weighted logistic regression to associate each of the 543 environmental attributes to diabetes status while adjusting for age, sex, body mass index (BMI), ethnicity, and an estimate for SES (Figure 15C). We acknowledge that estimating SES is difficult; nevertheless, we used the tertile of poverty index, equivalent to the participant’s household income divided by the time-adjusted poverty threshold, as the estimate for SES. We used R with the survey module to conduct all survey-weighted analyses [205, 206].

Exposures were captured either as continuous or a categorical variable. Most chemical exposure data arising from mass spectrometry or absorption measurements occurred within a very small range and had a right skew; thus, we log transformed these variables. Further, we applied a z-score transformation (adjusting each observation to the mean and scaling by the standard deviation) in order to compare odds ratios from the many regressions. Similarly, for categorical variables, we made the definition of the referent consistent, defining them to be the “negative” results of the test.

We calculated the false discovery rate (FDR), the estimated proportion of false discoveries made versus the number of real discoveries made at a given significance level, to control for type I error due to multiple hypotheses testing in associating the factors to disease status [40]. To estimate the number of false discoveries, we created a “null distribution” of regression test statistics by shuffling the diabetes status labels 1000 times and recomputing the regressions. The FDR was then estimated to be the ratio of the proportion of results that were called significant at a given level α in the null distribution and the proportion of results called significant from our real tests. To choose factors significantly associated with T2D in the first single-cohort phase, we used a

88 significance level (α=0.02), which corresponded to a FDR of 10% across three out of four cohorts (1999-2000, 2003-2004, and 2005-2006) and 30% for the 2001-2002 cohort.

To improve our power, we used the four independent cohorts to validate significant findings (Figure 15D). We considered a significant factor as “validated” if it was found to be significant (α=0.02) in more than one cohort, at the expense of having to drop those factors not measured in a second cohort. We then assessed the FDR of the multi-cohort validation. We first estimated the number of false positives by counting the number of factors found significant at a level α in two or more cohorts from the permuted datasets. We then estimated the FDR by computing the ratio between the number of false positives and the number of validated factors. This value was 2% with α equal to 0.02.

We fit a final logistic regression model with data combined from multiple NHANES cohorts utilizing all measurements for a specific environmental factor, attaining an overall odds ratio. The covariates of the final model were age, sex, BMI, ethnicity, SES, and cohort. We computed new sample weights for the combined datasets by taking the average of the original sample weights as described by the NHANES analytic guidelines [207].

We conducted 3 secondary analytic tests for the validity and sensitivity of our final estimates. We first attempted to check for reverse causality, or association of exposure due to T2D diagnosis. Our second test attempted to take into account the lipophilic characteristics of the environmental factors found. Our last test attempted to take into account recent food and supplement consumption as a potential bias for exposure measures. For adequate sample size and ease of comparison to the final fit model, we utilized all available data combining multiple NHANES cohorts as the sample to conduct these tests.

89

To attempt to account for one’s T2D diagnosis as a modifier of environmental exposure, known as “reverse causality”, we recomputed our models omitting those who had been diagnosed with diabetes. Individuals with a diabetes diagnosis were identified through yes answers submitted on a NHANES health questionnaire (“Doctor told you have diabetes?”). Thus, we refit our final models with individuals who were only at risk for T2D diagnosis.

Our second test attempted to account for the lipophilic chemical characteristics of our significant factors. Many of the environmental factors measured in NHANES absorb readily in fatty tissue; presence of fatty tissue is also associated with T2D and a potential confounder. Thus, we recomputed the models taking into account total triglycerides and cholesterol measured in blood specimen of participants.

In our third test, we attempted to compare dietary and supplement consumption of cases or controls gathered from 24- and 48-hour recall and supplement use questionnaires reasoning that recent intake may confound exposure-disease association. The NHANES data contains amount of food components consumed based on the dietary recall available for all participants examined above. Specifically, amounts of food components are computed from the questionnaire using the United States Department of Agriculture (USDA) Food and Nutrient Database. Some of the vitamin and nutrient components included vitamin A, vitamin B-6, vitamin B-12, vitamin C, vitamin E, vitamin K, carotenes, lycopene, thiamin, riboflavin, niacin, folate, calcium, iron, magnesium, phosphorus, potassium, sodium, iron, zinc, copper, and selenium. Other components included macronutrients, such as protein, carbohydrates, fat, fiber, and cholesterol. The total amount of food components considered numbered 51 to 63 for the different cohorts. Further, the 2003-2004 and 2005- 2006 cohorts contained both 24- and 48-hour recall data. Supplement use

90 included count of consumption of vitamins, minerals, botanicals, and/or their mixture of them over the past month prior to the survey. To check for possible confounding by recent consumption, we added each food and supplement variable to the logistic regression models specified above and re-evaluated significance and effect size of the validated environmental factors. We coded food component content as the logarithm (base 10) of the amount entered. We coded supplement use as an integer count value. We acknowledge the potential of bias with the use of questionnaire data and a pre-determined database of food items but assumed it was a reliable proxy of consumption and behavioral data in lieu of other information.

EWAS on T2D: Results Population characteristics Across the cohorts, the total non-weighted and weighted numbers of those who were diabetic compared to non-diabetic were similar. However, we did see significant differences with demographic factors such as sex, age, and socioeconomic status between cases and controls. T2D occurred in higher age groups in all cohorts (p < 0.001, 2-sided t-test). There were significantly more male participants than females in all cohorts (p < 0.001, 0.02, 0.03, χ2 test) except for 2005-2006. Furthermore, there was a significant association between lowest SES (first tertile of poverty index) and T2D (p=0.006, 0.03, 0.04, logistic regression) in for the 1999-2000, 2001-2002, and 2005-2006 cohorts respectively. While we did not see a univariate association between ethnicity and T2D as diagnosed by FBG, we did confirm previously reported associations of ethnicity to T2D when stratifying by age and sex [208]. As expected, BMI was significantly associated with T2D status (p < 0.001, t-test) for all cohorts. Given these differences between the cases and controls, we adjusted our logistic regression models described below accordingly.

91

Environment Associations to T2D Figure 16 shows the distribution of p-values of association for each environmental factor and class, adjusted for sex, age, BMI, ethnicity, and the estimate for SES, plotted in a “Manhattan plot” analogous to the association results from a GWAS study. The 37 significant or notable factors are annotated in the figure. Specific categories show association with T2D, such as organochlorine pesticides, nutrients/vitamins, polychlorinated biphenyls, and dioxins (Figure 16), having between 10 to 30% of the factors in the class with p-values less than 0.02. Many positive (low p-values) and negative (high p- values) associations replicated well among the different cohorts.

92

Figure 16. “Manhattan plot” style graphic showing the environment-wide association with T2D. Y-axis indicates -log10(p-value) of the adjusted logistic regression coefficient for each of the environmental factors. Colors represent different environmental classes as represented in Figure 15A. Within each environmental class, factors are arranged left to right in order from lowest to highest odds ratio (OR). Plot symbols represent different cohorts: 1999-2000 (diamonds), 2001-2002 (square), filled dot (2003-2004), circle (2005-2006). Red horizontal line is –log10(α)=1.8 (α=0.02). Validated factors significant in 2 or more NHANES cohorts are in bold face (α=0.02 in two or more cohorts, FDR of 2%) with larger plot points. Other significant factors (α=0.02) are annotated with numeric label corresponding to the environmental factor class color key on the right. Figure abbreviations: Validated factors: t-β-carotene: trans β-carotene; c-β-carotene: cis β-carotene; PCB170: 2,2',3,3',4,4',5-Heptachlorobiphenyl. Group 1 (dioxins): 1-hxcdd: 1,2,3,6,7,8-Hexachlorodibenzo-p-dioxin; 2-hxcdd: 1,2,3,7,8,9- Hexachlorodibenzo-p-dioxin. Group 2 (furans): OCDF: 1,2,3,4,6,7,8,9-Octachlorodibenzofuran. Group 3 (heavy metals): Ur: uranium; Sb: antimony; Pb: Lead. Group 4 (nutrients): tot-β-car: total β-carotene; α-car: alpha-carotene; retnl: retinol; Vita. D: vitamin D; δ-t: delta-tocopherol. Group 5 (organochlorine pesticides): DDE: dichlorodiphenyltrichloroethylene. Group 6 (PCB): PCB169: 3,3',4,4',5,5'-hexachlorobiphenyl; PCB138: 2,2',3,4,4',4',5- Hexachlorobiphenyl; PCB195: 2,2',3,3',4,4',5,6-Octachlorobiphenyl; PCB183: 2,2',3,4,4',5',6- Heptachlorobiphenyl; PCB199: 2,2',3,3',4,5,5',6'-Octachlorobiphenyl; PCB178: 2,2',3,3',5,5',6-

93 Heptachlorobiphenyl; PCB187: 2,2',3,4',5,5',6-Heptachlorobiphenyl; PCB180: 2,2',3,4,4',5,5'- Heptachlorobiphenyl; PCB146: 2,2',3,4',5,5'-Hexachlorobiphenyl; PCB196: 2,2',3,4,4',5,5',6- Octachlorobiphenyl. Group 7 (bacteria): H2: Herpes Simplex 2; HSBA: Hepatitis B Surface Antibody.

Table 7 shows those factors that were validated as being significant in two or more of the independent cohorts (multi-cohort validation FDR of 2%). Predicted probabilities of having T2D were computed for a prototype participant, a 45 year old white male with BMI of 27 (middle of the range for non-diabetics in the NHANES sample) and from the middle SES, at high and low exposure levels. For combined cohorts, the predicted probability applies to a prototype participant from the 2005-2006 cohort. We also computed the overall estimate by combining NHANES cohort data in a final model additionally adjusted for cohort; the predicted probabilities for these models were computed for a prototype participant as defined above. We defined low exposure as having a log transformed exposure level one standard deviation lower than the transformed mean, and high exposure as having a log transformed exposure level one standard deviation higher than the transformed mean. For example, a 45-year-old male from the 1999-2000 cohort with high levels (0.09 ng/g) of heptachlor epoxide has a 6% likelihood of being in our diabetes subset.

94

N† Predicted T2D, OR Factor Level Probability Factor Cohort No T2D P (95% CI) (Lo-Hi) (Lo-Hi) cis-β- 211, 0.6 carotene 2001-2002 2852 0.01 (0.5, 0.8) 0.4-1.4 µg/dL 0.12-0.05 207, 0.63 2003-2004 2698 0.002 (0.5, 0.7) 0.4-1.9 0.13-0.06 186, 0.6 2005-2006 2425 0.02 (0.5, 0.8) 0.4-1.6 0.15-0.06 604, 0.6 2001-2006* 7975 < 0.001 (0.5, 0.7) 0.4-1.7 0.15-0.06 trans-β- 211, 0.6 carotene 2001-2002 2854 0.01 (0.5, 0.8) 5.1-27.2 µg/dL 0.13-0.05 207, 0.7 2003-2004 2698 0.002 (0.6, 0.8) 4.8-24.7 0.13-0.06 203, 0.6 2005-2006 2701 0.004 (0.4, 0.7) 4.8-29.0 0.16-0.06 621, 0.6 2001-2006 * 8253 < 0.001 (0.5, 0.7) 4.9-27.0 0.15-0.06 γ- 146, 1.8 tocopherol 1999-2000 2091 0.02 (1.3, 2.4) 107-360 µg/dL 0.03-0.09 207, 1.6 2003-2004 2698 0.01 (1.3, 2.0) 103-356 0.06-0.13 767, 1.5 1999-2006* 10307 < 0.001 (1.3, 1.7) 107-352 0.06-0.13 Heptachlor 3.2 Epoxide 1999-2000 46, 635 0.002 (2.4, 4.4) 0.02-0.09 ng/g 0.01-0.06 1.9 2003-2004 67, 809 0.01 (1.3, 2.6) 0.01-0.07 0.02-0.07 178, 1.7 1999-2004* 2367 < 0.001 (1.3, 2.1) 0.02-0.08 0.03-0.07 2.3 PCB170 1999-2000 45, 716 0.02 (1.5, 3.6) 0.03-0.12 ng/g 0.01-0.06 4.5 2003-2004 53, 773 0.01 (2.1, 9.9) 0.01-0.12 0.03-0.42 165, 2.2 1999-2004* 2426 < 0.001 (1.6, 3.2) 0.02-0.13 0.04-0.15 Table 7. Highly statistically significant environmental factors associated to T2D found in more than one NHANES cohort. Odds ratio for each exposure, adjusted for BMI, age, sex, ethnicity, and SES is calculated for a change in the log exposure level by one standard deviation. Factor level is the amount of exposure defined by the low (1 SD lower than the average logged exposure level) and high range (1 SD higher than the average logged exposure level). The predicted probability range is an estimate for a 45-year-old white male with BMI of 27 kg/m2 from the middle SES to develop the disease in the low to high range of exposure. * denotes analysis using combined NHANES cohorts; models adjusted for age, sex, ethnicity, BMI, SES, and cohort; predicted probabilities for combined cohorts applies to an individual from the 2005-2006 cohort. † denotes unweighted number.

95 Nutrients and Vitamins: Carotenes and γ-tocopherol Several vitamins were found to have levels inversely associated with T2D. The first type included an antioxidant in the isoforms of β-carotene (final adjusted OR of 0.6; 95% CI 0.5-0.7; p < 0.001). For the prototypical participant, high levels of trans or cis β-carotene equated to a 9% improvement in risk (15 vs. 6%) for T2D status. We were able to confirm the inverse association of β-carotenes seen in multiple epidemiological studies in Saudi Arabia [209], among older people [210], among Swedish men [211], and in an earlier NHANES III cohort (pre-1999) [212], as well as another small study that showed an inverse response between fasting glucose level and β-carotene [213]. However, in a prospective case-control study β-carotene was not significantly inversely associated to T2D [214]. Because T2D is associated with reduced anti-oxidant defense, anti-oxidants, such as carotenes, have been occasionally recommended as a therapy [215]. However, the evidence of mitigation of T2D with these vitamins as therapies has been negligible in clinical trials, including women who are high risk of cardiovascular disease [216] or male smokers [217].

We discovered a vitamin that increased risk for T2D. Surprisingly, γ- tocopherol, a form of vitamin E, was highly significantly and positively associated with T2D (final adjusted OR 1.5; 95% CI 1.3, 1.7; p < 0.001) in two cohorts (adjusted OR of 1.8 and 1.6; p=0.02 and 0.01 for 1999-2000 and 2001- 2002 cohorts) and nearly significant in the two others (adjusted OR of 1.3 and 1.6; p=0.06 and 0.04 for 2001-2002 and 2005-2006 cohorts). For the prototypical participant, low levels of the γ-tocopherol equated to a 7% improvement in risk (13% vs. 6%). To our knowledge, this is a novel association between γ-tocopherol and T2D.

Persistent Pollutants: Polychlorinated Biphenyls and Organochlorine Pesticides

96 We found organochlorinated pesticides and polychlorinated biphenyls (PCBs), both related pollutant factors, to be a highly positively associated with T2D. Among the PCBs, we specifically discovered PCB170 (2,2',3,3',4,4',5- Heptachlorobiphenyl; final adjusted OR of 2.2; 95% CI 1.6-3.2; p < 0.001). The effect sizes in the individual cohorts for PCB170 were large (adjusted OR 2.3 and 4.5; p = 0.02 and 0.01 for 1999-2000 and 2003-2004 cohorts). The models predicted up to 15% T2D risk for the prototype participant, more than double the risk of those with low concentrations of PCB170. The association between the class of PCBs with T2D has been well described within Native American [218], Japanese [219], Swedish [220], and Taiwanese [221] cohorts.

Heptachlor epoxide, an oxidation product of the organochlorine pesticide heptachlor, was among the most highly associated factor (final adjusted OR 1.7; 95% CI 1.3-2.1; p < 0.001) in our EWAS. The effect sizes in the individual cohorts were also large (adjusted OR 3.2 and 1.9; p=0.002 and 0.01 for 1999-2000 and 2003-2004 cohorts). The predicted probability for the prototypical participant with high levels of the pollutant was 7%, more than 2 times greater than those who had lower levels of this pollutant.

Secondary analysis to test validity of the final estimates We then attempted to test the validity of our final estimates by conducting 3 additional analytic tests. In the first test, we attempted to consider the possibility of “reverse causality” or differential exposure status due to T2D diagnosis. Second, we attempted to assess the effect of potential confounding bias due to the lipophilic characteristics on our final environmental factor effect estimates. Third, we attempted to assess the effect of recent nutrient and supplement consumption on our final effect estimates.

To consider T2D diagnosis as a modulator of exposure, we removed all individuals who answered yes when questioned about a past history of diabetes

97 in the NHANES health questionnaire (“Doctor told you have diabetes?”). Thus, T2D cases were those who had a FBG higher than 125 mg/dL and were at risk for T2D diagnosis. We recomputed the effect of exposure, adjusted for age, sex, SES, ethnicity, BMI, and cohort. For all validated factors significant in more than 2 cohorts above (Table 7), the estimates remained stable and statistically significant. The effect size for Heptachlor Epoxide was marginally smaller with an adjusted OR of 1.6 (95% CI 1.1, 2.1; p = 0.008). The adjusted OR for PCB170 was also marginally smaller, 2.1 (95% CI 1.2, 3.9; p = 0.02). The effect of γ-tocopherol was larger, with an adjusted OR of 1.8 (95% CI 1.3, 2.2; p < 0.001) and there was no change to effect sizes of the carotenes (adjusted OR 0.6; 95% CI 0.5, 0.7; p < 0.001). We concluded that there was not enough evidence to support the phenomenon of reverse causality based on the effect sizes estimated for those who were at risk for T2D.

We next attempted to account for potential confounding bias of lipid levels. To assess the degree of possible confounding we refit the logistic regression adjusting for the logarithm (base 10) of total triglyceride and cholesterol levels in addition to age, sex, BMI, SES, ethnicity, and cohort. We did not observe a great change in effect sizes estimates for the environmental factors after this further adjustment for total triglycerides and cholesterol. The odds ratio after adjusting for lipid levels for carotenes was 14% higher, 0.7 (95% CI 0.6, 0.8; p < 0.001) compared to 0.6. Similarly, the odds ratio for γ-tocopherol was attenuated by 7%, 1.4 (95% CI 1.2, 1.6; p < 0.001) compared to 1.5 (Table 7). For the pesticide factor, the odds ratio was smaller by 6%, 1.6 (95% CI 1.3, 2.0; p < 0.001) versus 1.7 (Table 7). Lastly, for PCB factor, we observed a 3% higher odds ratio of 2.3 (95% CI 1.4, 3.7, p = 0.002) versus 2.2. Consistent with this secondary analysis, we observed a similar degree of effect size differences when using the “Lipid Adjusted” NHANES environmental factors, which are only provided for few of the pollutant factors (not shown). We

98 concluded that the effect sizes of the environmental factors were affected by lipid levels, but not substantially biased by them.

We then searched for differences in food and supplement consumption patterns between diabetics and non-diabetics for all 4 cohorts close to the time of survey derived from dietary recall and supplement use questionnaires. In comparing dietary nutrients, we did not observe a difference for any dietary nutrient except one between cases and controls. This exception included a lower total carbohydrate intake for diabetics versus controls, confirming that many diabetics may have known about their disease; specifically, the adjusted OR was 0.7 (95% CI 0.6, 0.8; p=0.001) for a 10% increase in total carbohydrate consumption, adjusted for sex, age, ethnicity, SES, and cohort. We also observed an inverse association between any supplement use and T2D, with an adjusted OR of 0.6 (95% CI 0.5, 0.8, p < 0.001), also consistent with our expectation of increased health awareness for those with T2D. However, we specifically could find no difference in consumption of carotenes or tocopherol (p=0.85 and 0.2 respectively) between cases and controls, two of the validated nutrient factors found in our EWAS (Table 7).

Having observed some difference in consumption behavior between cases and controls, we then attempted to assess the influence of recalled dietary consumption on the environmental associations by recomputing the logistic regression models in presence of dietary and supplement use variables. Adding the new dietary or supplementary vitamin consumption variables did not attenuate the odds ratios (maximum change of 1-2%), nor did they lessen the strength of the associations for all of the 5 validated environmental factors described in Table 7. Thus, we did not have evidence to support that recent consumption influenced the factor-disease effect sizes for the validated factors found in our EWAS.

99 We took a further step in assessing the strength of the environmental associations, adjusting for total triglycerides and cholesterol, any supplement use, and food intake simultaneously. Specifically, the odds ratio for a SD increase in γ-tocopherol levels was 1.3 (95 % CI 1.1, 1.5; p=0.004) when adjusting for logarithm base 10 of triglycerides, cholesterol, total vitamin E consumption, beta carotene consumption, total carbohydrate consumption, and any supplement use along with age, sex, ethnicity, BMI, and SES. The analogous models for the cis and trans β-carotene resulted in adjusted OR of 0.7 (95% CI 0.6, 0.8; p < 0.001). Odds ratios were consistently high and significant for the pollutant factors Heptachlor Epoxide and PCB170 after further analogous adjustment of recent consumption and total lipid levels, with odds ratios of 1.6 (95% CI 1.3, 2.1; p < 0.001) and 2.2 (95% CI 1.4, 3.5; p=0.003) respectively. We concluded that recent consumption as encoded by the dietary recall questionnaire in conjunction with lipid levels did not alter the validity of the associations of the 5 environmental factors found.

To summarize of our secondary tests for validity, we concluded that reverse causality, recent food and supplement consumption, and total lipid levels did not substantially bias our effect estimates for the 5 validated factors. These tests were made possible by the extensive list of co-variates available in the NHANES.

EWAS on T2D: Conclusions We have described a prototype Environmental-Wide Association Study (EWAS) and applied it to the study of Type 2 Diabetes (T2D), and validated many of our significant findings across independent cohorts and confirmed some of them through the literature. This study is made possible by the examination of multiple cohorts present in the nationally representative NHANES dataset. We have rediscovered factors such as carotenes and PCBs with previously known association to T2D. Unexpectedly, we found higher

100 levels of γ-tocopherol were associated with higher likelihood of T2D, independent of dietary intake. Of the components of Vitamin E, γ-tocopherol is the most abundant form in the US diet [222], and makes up to 50% of the total vitamin E in human muscle and adipose tissue [223], two known insulin-target tissues. As γ-tocopherol has been previously suggested as a preventive agent against colon cancer [224], any potential adverse metabolic effects for this vitamin should be studied closely.

Another novel finding was in the significant association between heptachlor epoxide levels and T2D. Heptachlor is a pesticide; most uses of heptachlor were discontinued in the late 1980s [225]. The main source of heptachlor and its breakdown product, heptachlor epoxide, is from food, but heptachlor epoxide is persistent in the environment and can even be passed in breast milk [226]. While a significant association with T2D has been reported across thirty-thousand pesticide applicators who used the pesticide heptachlor [227], to our knowledge, this broad association between heptachlor epoxide and T2D in the general public, as surveyed by NHANES, is novel.

While GWAS has allowed us to find novel variants associated with T2D of possible mechanistic importance and provided a model for a comprehensive study of the environment described here, associated variants have had only moderate effect sizes to date. Most of the risk loci identified with GWAS have small individual odds ratios, generally less than 1.3 [164, 202, 228] and the highest has been reported to be 1.71, belonging to a variant in the TCF7L2 gene [163, 165]. Albeit from different populations and analytical scenarios, the effect sizes of our validated environmental factors on T2D were comparable to the highest odds ratios seen in GWAS.

101 ENVIRONMENT-WIDE ASSOCIATION STUDY ON SERUM LIPID LEVELS Serum lipid levels correlate with the risk of coronary heart disease (CHD), atherosclerosis, stroke, and even the disease described above, type 2 diabetes (T2D) . Both genetic and environmental factors influence lipid phenotypes. Lipid level variation is 20-70% heritable [229, 230], while well-documented environmental or lifestyle factors include physical exercise, smoking, and diet [231-235]. Other less tangible factors, however, may also be important, as for example air pollution [236]. Here, we have applied the EWAS paradigm – extended from above -- to evaluate 322 environmental attributes for their association with triglycerides, high-density lipoprotein-cholesterol (HDL-C), and low-density lipoprotein cholesterol (LDL-C).

EWAS on Serum Lipids: Methods Data Laboratory data analyzed included serum and urine measures of environmental factors and clinical measures including lipid levels. We analyzed all factors that were a direct measurement of an extrinsic environmental attribute (e.g. amount of pesticide or heavy metal in urine or blood) as described earlier. Of 824 potentially eligible variables across all cohorts, we omitted 119 that varied little across individuals (continuous variables with > 99% of the observations below the detection limit threshold and binary variables with > 99% of observations in either the “negative” or “positive” class). Of the 705 remaining variables, 169 were measured in the 1999-2000 cohort, 182 from 2001-2002, 258 from 2003-2004, and 96 from 2005-2006. Cumulatively, they comprised 332 unique environmental factors, with 207 factors measured in >1 cohort. We binned these factors into 32 “classes” of related factors, analogous to chromosomal units in GWAS (Figure 15A) as described earlier.

102 Different environmental factors were measured in varying numbers of participants: 109-3610 (median 938), 101-3388 (median 896), and 222-7485 (median 1958) individuals for triglyceride, LDL-C, and HDL-C levels respectively (Figure 15B). Individuals are selected randomly to have these measurements and the selection procedure is dependent on their demographic characteristics due to the complex stratified population sampling of NHANES [237]. Serum triglyceride levels were measured in the morning after >8.5 hours’ fasting. LDL-C levels were derived from total cholesterol and direct HDL-C measurements used the Friedewald calculation.

Statistical analysis The systematic EWAS analysis encompasses multiple steps (Figure 15 C-H) as described earlier. First, survey-weighted linear regressions are performed for each environmental factor against log10 transformed lipid levels, adjusting for age, age-squared, sex, body mass index (BMI), ethnicity, and socioeconomic status (SES) (Figure 15C). For SES we used the tertile of poverty index (participant’s household income divided by the time-adjusted poverty threshold), as previously described. Ethnicity was coded in 5 groups (Mexican American, Non-Hispanic Black, Non-Hispanic White, Other Hispanic, Other). We used R survey module for all survey-weighted analyses [205, 206].

We calculated the false discovery rate (FDR), the estimated proportion of false discoveries made versus the number of total discoveries made for a given significance level α, to control for multiple hypothesis testing (Figure 15D)[40, 238]. We created a “null distribution” of regression test statistics for each cohort separately, shuffling the triglycerides, HDL-C, and LDL-C levels 1000 times and refitting the linear regression models. FDR is the ratio of the results called significant at a given level α in the null distribution and the results called significant from our real tests. We used FDR<10% to select significant

103 associations. This corresponds to α=0.02 for triglycerides, 0.02 for HDL-C, and 0.01 for LDL-C.

Next, we used the four independent cohorts to validate significant findings (Figure 15D). We considered a significant factor as “tentatively validated” if it was significant (FDR<10%) in more than one cohort and with the same direction of effect in all cohorts. Of the 332 factors, 125 were assessed in only 1 cohort and thus they could not be considered validated; 73 factors were assessed in 2 cohorts, 102 were assessed in 3 cohorts, and 32 were assessed in all 4 cohorts. We assessed the FDR of the multi-cohort validation empirically through permutations, as described in the previous chapter. Briefly, we first estimated the number of false positives by counting the number of factors found significant at level α in multiple cohorts from the permuted analyses. For example, to assess the FDR of validating a factor in 2 cohorts, we collected the factors that fell below the significance threshold α in the permuted data corresponding to two different cohorts and counted the number of factors found significant in both. We repeated this operation on all possible pairs (n=6) of cohorts, adding up numbers found to be significant in each pair. We then estimated the FDR by computing the ratio between the total number of false positives and the number of true validated factors. We repeated the analogous operation for factors significant in 3 and 4 cohorts. For triglyceride levels, FDR=0.008, 0.0003, and 5x10-5 for results validated in 2, 3 and 4 cohorts, respectively. For HDL-C, the respective FDR is 0.01, 0.0002, and 5x10-5. For LDL-C, the respective FDR is 0.009, 3x10-5, and < 10-9.

We then fit a final linear regression model with data combined from multiple NHANES cohorts utilizing all measurements available for a tentatively validated environmental factor, attaining an overall estimate and p-value (Figure 15E). We utilized the larger sample size to adjust for additional quantitative factors that we were unable to adjust for in the single cohort

104 analyses (due to small residual degrees of freedom). In addition to initial covariates, we also adjusted for waist circumference, T2D status (as defined in the previous section, fasting blood glucose ≥ 126 mg/dL), systolic and diastolic blood pressure (mm Hg), and cohort. To estimate how much of the variance was described by each environmental factor, we estimated the change in the coefficient of determination (R2) adding that factor versus a model including only the adjusting factors (Figure 15F). We also performed regressions on untransformed lipid levels to estimate raw effect size.

Sensitivity analyses We conducted sensitivity analyses to account for recent food, alcohol, supplements, medications, exercise, and history of cardiovascular health (Figure 15G). Sixty-two questionnaire items were used. For adequate sample size and consistency with the final-fit model, we combined all available NHANES cohorts.

Intake variables (total calories, carbohydrates, saturated fat, monounsaturated fat, alcohol, cholesterol, vitamins, and iron) are computed from the questionnaire using the USDA Food and Nutrient Database. All cohorts contained 24-hour recall data. For 2003-2004 and 2005-2006 cohorts we computed an average of 24- and 48-hour recall data. Supplement use included the integer count of consumption of vitamins, minerals, botanicals, and/or their mixtures the month prior to the survey. Consumption of any fish or shellfish during the last month was also considered.

For individuals with abnormal levels of lipids, drug therapies such as statins, resins, and fibrates are often prescribed [239]. Therefore, we sought to adjust for any use of these medications. Drug use definition required that the individual used the drug during the month prior to the survey and the interviewer saw the prescription bottle.

105

Physical activity also influences lipid levels. We therefore classified individuals in high, medium, or light intensity weekly exercise categories by computing metabolic equivalents of recalled activity levels [240, 241], including components such as leisure time, occupational and household routines-related activity.

Finally, recalled cardiovascular health history was based on positive response to questions on the presence of coronary heart disease angina/angina pectoris, heart attack(s), or congestive heart failure.

To evaluate the impact of these 62 adjusting variables, we recomputed the regression models by adding each variable to our final model one-by-one and observed the change in the effect size for each putative environmental factor. We also built a model adjusting for lipid-lowering drugs, supplement use, exercise, and self-report cardiovascular-related disease simultaneously.

Correlation pattern between factors associated with lipid levels Identified factors that are associated with lipid levels may not be independent. Therefore, we also computed all Pearson correlation coefficients between each of the validated environmental factors as well as the demographic (age, sex, ethnicity, SES), and clinical (BMI, waist circumference, blood pressure, and diabetes status) risk factors to ascertain the pattern of relationships among them (Figure 15H). Next, we visualized all of these correlations as a “correlation globe” to infer their inter-dependence as a function of all variables examined. This approach has been utilized to discover inter-related or dependent sets of genes in a gene expression microarray experiment [176, 177].

Power calculations

106 We estimated [242] that the EWAS had >90% median power for all cohorts for detection of 5% change in HDL-C (p<0.02) and LDL-C (p<0.01) and 10% change of triglyceride levels (p<0.02).

EWAS on Serum Lipids: Results Demographic and baseline associations with lipid levels As expected[243], demographics, BMI, ethnicity, and SES correlated with lipid levels. For example, consistent positive correlations existed between age and triglycerides (5-10% higher per 10 years, p-values<0.02), and BMI and triglycerides (2% higher per 1 unit of BMI, p-values<0.004), and consistent negative correlations between black ethnicity and triglycerides (13% lower vs. white, p-values<0.001) [244]. Consistent polynomial relationships existed between age and both HDL-C and LDL-C. Negative correlations existed between BMI and HDL-C (1% lower per BMI unit, p-values<0.0001). In addition, SES was associated with HDL-C (1-5% lower for lower vs. higher tertile, p-values<0.03).

Environment associations with lipid levels Figure 17 shows the distribution of p-values of association for each environmental factor binned by its class, adjusted for sex, age, age-squared, BMI, ethnicity, and SES. For triglyceride levels, 10/169, 24/182, 49/258, and 12/96 factors passed the requested threshold of significance for the 1999-2000, 2001-2002, 2003-2004, and 2005-2006 cohorts respectively. Likewise for LDL-C, 1/169, 8/182, 15/258, and 11/96 were significant, respectively. For HDL-C, 2/169, 21/182, 39 /258, and 15/96 were significant. Using other cohorts, we tentatively validated significant findings from our initial screen. Across cohorts, there were 22, 8, and 17 tentatively validated factors for triglycerides, LDL-C and HDL-C, respectively (Figure 17 A-C).

107 The data was combined across cohorts for each tentatively validated factor and estimates were further adjusted for waist circumference, T2D status, blood pressure, and cohort. The variance ascribed to baseline co-variates was 22-25% (triglycerides), 15-16% (LDL-C), and 23-26% (HDL-C). Each of the tentatively validated environmental factors described an additional 0.7-18.4% (triglycerides), 1.8-14.1% (LDL-C), and 0.4-4.0% (HDL-C) of the variance in lipid levels.

108 A triglycerides cohort markers ! 1999-2000

4 2001-2002 !! 2003-2004 ! ! 2005-2006 ! ! !

3 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! !! ! ! ! ! ! ! 2 ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! log10(pvalue) ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !

− ! ! ! ! ! ! ! ! ! ! ! ! ! ! 1 ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! !! ! !! ! !! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! !!! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! !! ! !!! ! 0

B LDL-C latex pcbs diakyl

dioxins ! cotinine phenols

4 ! !

! phthalates acrylamide

!! perchlorate !! allergen test

! infection viral heavy metals heavy ! hydrocarbons ! ! phytoestrogens ! ! ! 3

bacterial infection ! nutrients minerals ! nutrients vitamin A nutrients vitamin B nutrients vitamin E pesticides atrazine ! nutrients vitamin C nutrients vitamin D ! volatile compounds volatile polyflourochemicals nutrients carotenoid

furans dibenzofuran furans ! pesticides carbamate pesticides pyrethyroid

! polybrominated ethers ! !

! pesticides chlorophenol pesticides organochlorine ! ! 2 ! !

! pesticides organophosphate !

! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !

log10(pvalue) ! ! ! ! ! ! !! ! ! ! ! ! !! !! ! ! ! ! − ! ! ! ! ! ! ! ! ! 1 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !! ! !! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! !! ! ! !! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! !!!! ! ! ! !! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! 0 latex pcbs diakyl dioxins cotinine

! phenols

C HDL-C phthalates

acrylamide !

! perchlorate allergen test viral infection viral heavy metals heavy hydrocarbons

4 ! !

!! phytoestrogens bacterial infection ! nutrients minerals nutrients vitamin A nutrients vitamin B nutrients vitamin E pesticides atrazine nutrients vitamin C nutrients vitamin D volatile compounds volatile polyflourochemicals nutrients carotenoid furans dibenzofuran furans ! !! ! pesticides carbamate ! pesticides pyrethyroid 3 ! !! polybrominated ethers ! pesticides chlorophenol !! ! !! ! ! ! pesticides organochlorine ! ! ! ! ! ! ! ! ! pesticides organophosphate ! ! 2 ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! log10(pvalue) ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! − ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! !!! ! !! ! ! ! ! ! 1 ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! !!! ! ! ! ! ! !! ! !! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! !!! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! !!!! ! ! !! ! ! ! !! ! ! ! ! ! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! !! !! ! ! ! ! ! ! ! ! ! ! ! ! ! ! !! ! ! ! ! ! ! ! ! !!!! ! ! !! ! 0 latex pcbs diakyl dioxins cotinine phenols phthalates acrylamide perchlorate allergen test viral infection viral heavy metals heavy hydrocarbons phytoestrogens bacterial infection nutrients minerals nutrients vitamin A nutrients vitamin B nutrients vitamin E pesticides atrazine nutrients vitamin C nutrients vitamin D volatile compounds volatile polyflourochemicals nutrients carotenoid furans dibenzofuran furans pesticides carbamate pesticides pyrethyroid polybrominated ethers pesticides chlorophenol pesticides organochlorine pesticides organophosphate

Figure 17. “Manhattan plot” style graphic showing the environment-wide associations to lipid levels. Y-axis indicates -log10(p-value) of the adjusted linear regression coefficient for each of

109 the environmental factors. Colors represent different environmental classes as represented in Figure 15. Plot symbols represent different cohorts: 1999-2000 (diamonds), 2001-2002 (square), filled dot (2003-2004), circle (2005-2006). Red horizontal line represents the level of significance corresponding to FDR less than 10%. A) log10(triglycerides), B) log10(LDL-C), C) log10(HDL-C).

Effects for the top tentatively validated associations for triglycerides, LDL-C, and HDL-C are shown in Figure 18, Figure 19, and Figure 20. Although we found 22 and 17 factors for triglycerides and HDL-C respectively, we display the top 2 findings (total of 12) for each category for visualization. Furthermore, we discuss here some of them in more detail. Effect sizes for continuous variables are for 1 SD of log-transformed value of the environmental factor.

110 cohort N pvalue 1,2,3,4,7,8−hxcdf (mg/dl) effect 2001−2002 534 0.01 55 2003−2004 806 0.005 48 combined 1735 2e−05 30 trans−b−carotene 2001−2002 3605 0.002 −18 2003−2004 3233 0.01 −18 2005−2006 2889 7e−04 −19 combined 7374 1e−08 −16 cis−b−carotene 2001−2002 3605 0.01 −9 2003−2004 3233 0.01 −17 2005−2006 2596 0.001 −15 combined 7135 1e−06 −12 Retinol 1999−2000 2996 0.005 23 2001−2002 3610 5e−04 23 2003−2004 3233 2e−04 29 2005−2006 2889 3e−04 27 combined 9519 6e−21 25 Retinyl palmitate 2001−2002 3480 7e−04 37 2003−2004 3233 2e−04 62 2005−2006 2755 0.003 24 combined 8903 6e−17 41 a−tocopherol 1999−2000 2981 0.001 49 2001−2002 3609 2e−04 86 2005−2006 2889 2e−04 57 combined 7140 8e−20 67 g−tocopherol 1999−2000 2585 0.01 24 2001−2002 3579 0.002 51 2003−2004 3233 0.002 42 2005−2006 2872 7e−05 39 combined 9194 1e−17 41 PCB74 1999−2000 811 0.01 37 2003−2004 832 0.005 61 combined 2202 1e−06 38 PCB170 2001−2002 1004 0.01 62 2003−2004 825 0.002 86 combined 2155 4e−06 50 Oxychlordane 1999−2000 704 0.02 53 2001−2002 986 0.002 78 2003−2004 877 0.003 53 combined 2131 5e−09 57 Trans−nonachlor 1999−2000 814 0.02 42 2001−2002 1001 0.002 66 2003−2004 865 0.005 49 combined 2228 1e−08 47 Enterolactone 2001−2002 1149 0.02 −14 2003−2004 1073 0.006 −20 combined 2358 2e−07 −17

−20 −10 0 10 20 30 40 % change Figure 18. Forest plot for top 12 validated environmental factors per cohort associated with triglycerides in a model adjusting for age, age-squared, SES, ethnicity, sex, BMI. Combined cohort denotes the estimate attained when combining all cohorts available for exposure in a model adjusting for age, age-squared, SES, ethnicity, sex, BMI, waist circumference, T2D status, blood pressure, and cohort. Percent change (x-axis) is the percent change of lipid level for a change in 1SD of logged exposure value. Effect size (in mg/dL) attained when fitting the untransformed lipids to the model. Symbols proportional to sample size and colors represent different environmental classes as represented in Figure 15.

111 cohort N pvalue trans−b−carotene (mg/dl) effect 2001−2002 3315 0.003 8 2003−2004 3174 0.004 6 2005−2006 2830 9e−04 9 combined 7043 2e−13 8 cis−b−carotene 2001−2002 3317 0.002 7 2005−2006 2541 0.004 7 combined 6809 5e−11 6 b−cryptoxanthin 2001−2002 3294 9e−04 7 2003−2004 3174 6e−04 7 2005−2006 2805 0.001 9 combined 7012 4e−13 8 Combined Lutein/zeaxanthin 2001−2002 3317 0.001 9 2003−2004 3174 2e−04 8 2005−2006 2830 5e−04 10 combined 7043 3e−15 9 trans−lycopene 2001−2002 3315 5e−04 10 2003−2004 3174 1e−04 10 2005−2006 2830 2e−04 14 combined 7043 8e−17 12 Retinyl palmitate 2001−2002 3200 8e−04 5 2005−2006 2698 0.001 8 combined 8425 4e−13 6 a−tocopherol 1999−2000 2734 0.002 14 2001−2002 3317 8e−05 17 2005−2006 2830 7e−05 17 combined 6665 7e−19 16 g−tocopherol 2001−2002 3288 0.003 8 2003−2004 3174 0.002 6 2005−2006 2814 0.005 6 combined 8696 3e−14 6

0 5 10 15 20 % change

Figure 19. Forest plot for validated environmental factors associated with LDL-C. See Figure 18.

112

cohort N pvalue Cotinine (mg/dl) effect 2003−2004 7267 0.003 −2 2005−2006 6959 0.02 −1 combined 9513 2e−06 −1 Mercury, total 2003−2004 7273 0.01 1 2005−2006 6961 0.002 2 combined 6323 6e−07 2 2−fluorene 2001−2002 2332 0.01 −2 2003−2004 2192 0.006 −1 combined 2252 0.004 −1 3−fluorene 2001−2002 2332 0.02 −2 2003−2004 2176 0.01 −1 combined 2243 0.006 −1 Combined Lutein/zeaxanthin 2001−2002 7473 2e−04 3 2003−2004 6790 2e−05 3 2005−2006 6868 4e−04 3 combined 7388 2e−16 4 cis−b−carotene 2001−2002 7478 3e−04 2 2003−2004 6790 9e−04 3 2005−2006 6264 2e−04 3 combined 7151 3e−12 3 Iron, Frozen Serum 1999−2000 6383 0.009 2 2001−2002 7457 0.003 2 2003−2004 2706 0.006 2 2005−2006 2524 0.002 2 combined 6764 6e−11 2 Retinyl stearate 2001−2002 7251 0.002 −1 2003−2004 6790 0.003 −1 2005−2006 6337 0.002 −2 combined 8421 4e−05 −1 Folate, serum 2001−2002 7468 0.004 1 2003−2004 7267 0.02 1 combined 9559 2e−05 1 Vitamin C 2003−2004 6799 0.006 2 2005−2006 6911 0.02 1 combined 4852 0.002 1 Vitamin D 2001−2002 7056 0.01 1 2003−2004 7273 0.004 2 2005−2006 6966 0.01 1 combined 7401 1e−06 2 g−tocopherol 2001−2002 7428 0.001 −1 2003−2004 6790 0.01 −1 combined 9216 6e−06 −1 Heptachlor Epoxide 2001−2002 2022 0.01 −2 2003−2004 1835 0.02 −1 combined 2108 0.006 −2 −6 −4 −2 0 2 4 6 % change Figure 20. Forest plot for top 12 validated environmental factors associated with HDL-C. See Figure 18.

113

Vitamins A and E: unfavorable association with lipid levels For all three lipids, we found a consistent association for lipid-soluble, anti- oxidant vitamins, such as vitamin A, E, and carotenoids (Figure 17, Figure 18, Figure 19, Figure 20). For example, a form of vitamin A, retinol, was positively associated with triglycerides (p=6x10-21, effect=10% or 25 mg/dL higher triglycerides per 1SD) in all cohorts examined. Another form of vitamin A, retinyl palmitate was also positively associated with triglycerides (p=6x10-21, effect=10%) and LDL-C (p=4x10-13, effect=5% or 6 mg/dL). Retinyl stearate was negatively associated with HDL-C (p=4x10-5, effect=-3% or -1 mg/dL). Retinol is the functional form of vitamin A produced in the body from β-carotene and is a co-factor in biological processes associated with vision and gene transcription [245]. Retinyl palmitate and stearate are animal- and supplement-sourced vitamin A esters stored in the liver [245].

We observed a consistent association between forms of vitamin E (α and γ tocopherol) and lipid levels. α-tocopherol strongly correlated with higher triglyceride and LDL-C levels (effect=35% (p=8x10-20) and 16% (p=7x10-19), or 67 and 16 mg/dL, respectively). γ-tocopherol was also correlated with higher triglycerides (effect=17% higher, p=10-17) and LDL-C (6% higher, p=3x10-14) levels, but also with lower HDL-C (effect=-2% , p=6x10-6). Vitamin E is consumed via vegetables, nuts, oils, and supplements. Tocopherols are highly lipophilic and their absorption is enhanced by triglycerides.

Carotenoids: favorable association with HDL-C and triglycerides and unfavorable association with LDL-C Both isomers of β-carotene, cis- and trans- were associated with lower triglyceride levels (p=10-6, effect=-7% or 12 mg/dL; p=10-8, effect=-10% or 16

114 mg/dL respectively). However, both isomers of carotene, in addition to other carotenoids such as β-cryptoxanthin and lycopene were consistently associated with higher levels of both HDL-C and LDL-C. The effect was 5% (p=3x10-12) and 6% (p=5x10-11) for HDL-C and LDL-C levels respectively for cis-β- carotene and 3% (p=10-10) and 12% (p=8x10-17) for lycopene. Carotenoids are primarily sourced from consumption of fruits and leafy vegetables[246]; β- and α-carotene (but not lycopene) are vitamin A precursors [245, 246].

Favorable lipid correlations with vitamins B, C, D, iron, mercury, and enterolactone We found serum levels of folate (vitamin B), C, D, iron, and mercury to be favorably associated with HDL-C (Figure 20). Effect sizes of vitamin, iron and mercury levels on HDL-C were similar, ranging from 3 to 4% (1-2 mg/dL) higher HDL-C (p<0·002). Last, we found enterolactone, a product of lignan metabolism in the intestine, to be associated with 10% (17mg/dL) lower triglyceride levels (p=2x10-7, Figure 18).

Persistent pollutants: unfavorable association with triglycerides and HDL-C Polychlorinated biphenyls (PCBs), dibenzofurans, and organochlorine pesticides, all persistent organic pollutants, were unfavorably associated with both triglyceride and HDL-C levels (Figure 18, Figure 20). Seven PCB factors were tentatively validated and the most significant cogeners PCB74 and PCB170 were associated with 15% (p=10-6) and 19% (p=4x10-6) higher triglyceride levels. Five organochlorine factors were tentatively validated, among which oxychlordane and trans-nonachlor changes were linked to 29% and 30% higher (p=5 x 10-9, 1 x 10-8) triglyceride levels. Another organochlorine pesticide, heptachlor epoxide, was associated with 3% lower HDL-C (p=0.006). While use of these compounds is banned, they are known to persist and accumulate due to their stability and lipophilicity.

115 Markers for air pollution and nicotine: unfavorable association with HDL-C Several markers of air pollution and nicotine exposure were unfavorably associated with HDL-C (Figure 20). The polyaromatic hydrocarbon markers of fluorene, 3-hydroxyfluorene and 2-hydroxyfluorene, were associated with 3% lower HDL-C (p=0.006 and p=0.004). Cotinine, a serum biomarker for nicotine, was associated also with a 3% lower HDL-C (p=2 x 10-6). Polyaromatic compounds are formed as a result of burning of hydrocarbon- based substances, such as tobacco, coal, gas, oil, and meats.

Sensitivity analyses with further adjustments For most questionnaire variable adjustments, we did not see a sizable difference in estimated coefficients or p-values for the environmental factors (Figure 21, Figure 22, Figure 23), including questionnaire items regarding self- report cardiovascular-related disease status and use of drugs. Interestingly, in most exceptions, adjustments increased the effect size of the environmental factor. For example, after adjusting for vitamin and supplement intake, the associations between γ- and α-tocopherol and triglyceride and LDL-C levels became stronger. Similarly, adjustment for total fiber intake also strengthened the association of β-carotenes and cryptoxanthin with LDL-C. The association of cotinine, 3-, and 2-fluorene with HDL-C strengthened after adjustment for alcohol intake. Adjusting for any fish and shellfish consumption strengthened the association between pollutants and triglycerides. Adjustment for fish and shellfish consumption strengthened the association between retinyl stearate and HDL-C and triglyceride levels. Conversely, the effect of vitamin C and folate in relation to HDL-C decreased when taking supplement count, total fiber intake, and physical activity into account. Adjusting for supplement count decreased the effect of γ-tocopherol on HDL-C.

116 100*(extended-original)/original

-15 -10 -5 0 5 10 15 20 0.003 0.003 0.003 1,2,3,4,7,8-hxcdf 9e-04

a-Carotene trans-b-carotene cis-b-carotene 1e-07 1e-07 Retinyl palmitate Retinyl stearate 1e-17 Retinol 8e-16 7e-20 g-tocopherol 2e-18 0.006 a-tocopherol 2e-04 PCB199 0.002 PCB74 0.002 6e-04

PCB99 0.002 2e-04 PCB156 2e-04

PCB170 0.003

PCB196 & 203 0.06 0.001 PCB206 0.001 2e-04 Beta-hexachlorocyclohexane 2e-04

Dieldrin Heptachlor Epoxide Oxychlordane physical_activity count any_fish any_shellfish TBCAR TATOC cardiovascular TLZ Trans-nonachlor Enterolactone

Figure 21. Percent change in effect size (βfactor) between “original” and “extended” linear regression models predicting log10(triglycerides). “Original” estimates were adjusted for sex, age, age-squared, SES, ethnicity, BMI, waist circumference, blood pressure, T2D status (fasting blood glucose >= 125 mg/dL), and cohort. “Extended” estimates were adjusted for the same co-variates in the original model (age, age-squared, sex, ethnicity, SES, BMI, waist circumference, blood pressure, diabetes, cohort), in addition to questionnaire items added sequentially. For points annotated as “cardiovascular” (red diamond), the extended estimates were adjusted for the same co-variates in the original model in addition to “count”, “physical_activity”, “any_heart_disease”, and “cholesterol_lowering” simultaneously. The estimates of βfactor that were greater than 10% than the original estimate upon adding the extra co-variate are annotated in color. P-values for the “extended” βfactor are shown to the left of the point. Legend abbreviations: TLZ: total lutein/zeaxanthin; TATOC: total tocopherol; TBCAR: total β-carotene; any_shellfish: any shellfish consumed in past 30 days; any_fish: any fish consumed past 30 days, count: total supplement used in 30 days; physical_activity: total physical activity in metabolic equivalents past 30 days.

117 100*(extended-original)/original

-5 0 5 10 15 3e-13

trans-b-carotene 8e-11

cis-b-carotene 1e-12 1e-12 1e-12

b-cryptoxanthin

Combined Lutein/zeaxanthin

trans-lycopene

Retinyl palmitate 2e-14 3e-14

g-tocopherol 7e-19 1e-18 cardiovascular count TCRYP TVC TFIBE a-tocopherol

Figure 22. Percent change in effect size (βfactor) between “original” and “extended” linear regression models predicting log10(LDL-C). See Figure 21 for complete caption. Legend abbreviations: TFIBE: total fiber; TVC: total vitamin C; TCRYP: total cryptoxanthin; count: total supplement use in past 30 days; cardiovascular: on lipid lowering drug past 30 days or doctor said participant had heart disease.

118 100*(extended-original)/original

-40 -30 -20 -10 0 10 20 2e-06 2e-06 TPOTA physical_activity count any_shellfish any_fish TFF TMAGN TALCO cardiovascular TFIBE 5e-06 1e-06 7e-07

Cotinine 6e-06 0.007

Mercury, total 0.02 0.006 0.006

3-fluorene 0.005 0.01 1e-08

2-fluorene 7e-10 1e-09

a-Carotene 2e-10

trans-b-carotene 3e-11

cis-b-carotene

b-cryptoxanthin

Combined Lutein/zeaxanthin

trans-lycopene 8e-04 5e-05 Iron, Frozen Serum 2e-04 0.1 0.1 1e-04 2e-05 1e-05 1e-05 1e-05 Retinyl stearate 5e-05 0.004 9e-04 0.004 9e-04 9e-04 9e-04 Folate, serum 0.002

Vitamin C 2e-06 5e-06 6e-05 1e-05 Vitamin D 7e-06

g-tocopherol

Heptachlor Epoxide

Figure 23. Percent change in effect size (βfactor) between “original” and “extended” linear regression models predicting log10(HDL-C). See Figure 21 for complete caption. Legend abbreviations: TFIBE: total fiber; cardiovascular: on lipid lowering drug past 30 days or doctor said participant had heart disease.; TALCO: total alcohol; TMAGN: total magnesium; TFF: total food folate; any_fish: any fish consumed in past 30 days; any_shellfish: any shellfish consumed in past 30 days; count: total supplement use past 30 days; physical activity: total physical activity in past 30 days; TPOTA: total potassium.

119 Simultaneous adjustment for self-reported cardiovascular-related disease, supplement count, lipid-lowering drugs, and physical activity strengthened the association between tocopherols and pollutant factors and triglycerides, while attenuating the association to α-carotene (Figure 21). For HDL-C levels, effects of cotinine, mercury, 3- and 2-flourene, folate, vitamin C, vitamin D, and γ-tocopherol were all attenuated > 15% (Figure 23). However, the direction and significance of the effects were preserved throughout.

Correlation patterns Evaluation of Pearson correlations showed a dense correlation pattern for triglycerides (Figure 24) and more sporadic strong correlations between various factors for HDL-C and LDL-C (Figure 25, Figure 26). As expected, we observed strong correlations among closely related factors, such as between PCBs (ρ > 0.6) or carotenoids (ρ > 0.5), and even within factor classes, such as organochlorine pesticides (ρ > 0.3). Of note, the hydrocarbon factors 2-and 3- hydroxyfluorene were highly correlated with cotinine (ρ=0.6 and 0.7). The baseline demographics were not strongly related (e.g, ρ > 0.5) with any of the environmental factors, with the exception of age that showed several strong associations with many of them.

120

Figure 24. Pair-wise correlation globes for validated environmental and risk factors associated with triglycerides. Each node corresponds to a validated environmental (in color of environmental class, see Figure 15) or demographic/clinical risk factor (in white). Correlations > 0.2 and < -0.2 are shown with line thickness proportional to the absolute value of correlation. Line color corresponds to the sign of correlation (positive=red, negative=blue). To avoid overcrowding, only the most highly associated PCB and organochlorine factors are shown .

121

Figure 25. Pair-wise correlation globes for validated environmental and risk factors associated with LDL-C.

122

Figure 26. Pair-wise correlation globes for validated environmental and risk factors associated with HDL-C. To avoid overcrowding only the most highly associated PCB and organochlorine is shown.

EWAS on Serum Lipids: Conclusions In the current application, our findings reveal complex relationships between serum lipid levels and fat-soluble antioxidant vitamins. Randomized studies and meta-analyses[43, 247-250] have shown these vitamins to have no benefits or even confer harm when given in high doses, in contrast to previous favorable associations in observational studies [251, 252]. The unfavorable lipid profile that we observed with vitamin E forms is possibly consistent with the randomized evidence on hard clinical outcomes and we also found an unfavorable lipid profile for vitamin A forms. Carotenoids have a mixed effect, improving triglycerides and HDL-C, but worsening LDL-C.

123 These associations may reflect a complex web of physiological correlation or even reverse causality. For example, α-tocopherol and carotenes are transported in serum with HDL and LDL [246, 253, 254] and accurate measurement of serum α-tocopherol is dependent on serum lipids [255]. In this regard, the strong association between α-tocopherol and LDL-C and triglycerides might be considered a true positive result. On the other hand, given the lack of evidence for γ-tocopherol or retinol associating with lipoprotein complexes, their association might be due to reverse causality, or increased anti-oxidant consumption among those who know about their adverse lipid level profile. Nevertheless, given that vitamin E consumption has been found to increase mortality in meta-analysis[43], the large effect sizes suggest that prospective studies may be scrutinized for any potentially adverse effects of vitamin E on lipid levels and other metabolic disorders, such as T2D .

We observed an association of vitamins B (folate), C and D, mercury, and iron, to higher HDL-C levels. Folate [256] and vitamin D [257] have previously been associated with higher HDL-C. Fish, a source of cardioprotective omega- 3 fatty acids, are also a large source of mercury[258]; however, we did not observe a large change in effect size of mercury when accounting for consumption of fish. These nutrients and metals may be to some extent surrogate markers of “healthy diet” behaviors; however what exactly constitutes “healthy diets” is currently very difficult to define, in contrast to earlier claims [259, 260]. The strength of the association for these dietary markers is similar on HDL-C, ranging from 1-3 mg/dL for a standardized change per factor. These are small effects and it is unclear whether cumulatively they could have a much larger impact in raising HDL-C level, given the correlations between these markers.

124 We also identified enterolactone to be strongly associated with favorable triglyceride levels in this study. Enterolactone is a metabolite of lignans, which are found in foods such as flaxseed, and have been associated with favorable cholesterol profiles in this form [261, 262]. Again, it is unclear what role, if any, this marker plays as a surrogate of “healthy diets” and effects on heart disease have been inconsistent [263].

We found biomarkers of hydrocarbons, 2- and 3-hydroxyfluorene to be strongly associated with unfavorable HDL-C levels. While others have shown the association of these metabolites to self-report cardiovascular disease with the NHANES data [264], to our knowledge the association with HDL-C is novel. Relatedly, we also found a marker of nicotine, cotinine, to have a similar association with HDL-C. Particulate matter air pollution, composed of many types of hydrocarbons, and smoking long have been a major concern for cardiovascular-related diseases [236, 265, 266]. Smoking is well known to influence HDL-C levels [267, 268] and acute and chronic exposures to tobacco smoke have been shown to decrease HDL-C substantially [269]. The high correlation of the hydrocarbon markers to cotinine suggests that these associations might all indicate exposure to cigarette smoke.

We also found that persistent organic pollutants, such as organochlorine pesticides and polychlorinated biphenyls, were associated with large increase of triglycerides and large decrease in HDL-C. These compounds have been implicated in other metabolic-related diseases and other populations. For example, PCB170 and heptachlor epoxide have been associated to T2D in our previous EWAS on T2D. Similarly, PCBs and dibenzofurans have been associated with metabolic syndrome in a Japanese population [270] and have been claimed to have an “obesogenic” effect [271].

125 We should acknowledge that these associations might be confounded due to the fat solubility of these pollutants. Nevertheless, there have been efforts to characterize this relationship. For example, in a study analytically considering causal pathways and confounding bias via structural equation modeling, the investigators found a relationship between polychlorinated biphenyls and lipid levels consistent with forward causality for a native population with high exposure of these pollutants in upstate New York [272]. Another study found an ecological relationship between cardiovascular-related hospitalization rates in areas close to PCB pollution [273]. Further, higher incidence of cardiovascular disease was observed in an occupational Swedish cohort [274]. Nevertheless, and notably, these studies took place in populations in which the source of exposure was known and dosages were much higher than the general population levels seen in NHANES.

Finally, identifying specific heritable components through GWAS has proven difficult: one recent study attributed 10-12% of variability of the lipid levels to 95 genetic loci in a sample of >100,000 individuals [65], and each genetic factor explained less than 0.5% of the variance. By comparison, each of the validated environmental factors described a larger portion of the variance in lipid levels, occasionally exceeding 10%; however, reverse causality cannot be excluded as for the genetic variants.

DISCUSSION By combing through all environmental exposure measures using a systematic EWAS approach, we have found novel multiple environmental factors associated with type 2 diabetes and lipid levels beyond the level of false discovery. The method is general enough to apply to diverse datasets, and, in fact, collaborators have begun utilizing EWAS to study blood pressure and kidney disease. The EWAS approach bypasses the problem of selectively testing and reporting one or a few associations at a time that has been

126 repeatedly debated as a source of biased results and false positives in epidemiological studies [10, 42, 179, 275, 276]. In its current form, EWAS offers a new way to generate a comprehensive list of associations that have robust support after multiple comparisons, a practice not followed in environmental epidemiology currently. The ensuing associations are then carefully scrutinized for their validity in sensitivity analyses. Adjustments for potential confounders are also systematic and correlation patterns between variables are evaluated and visualized.

Like GWAS, the EWAS framework can be used to propose targets for further study. For example, many factors are correlated; some are similar structurally, such as the isomers of β-carotene, or show dependent patterns of exposure environment, such as the PCBs and organochlorine pesticides or the serum markers for cotinine and urine markers for hydrocarbons. As we extend the GWAS analogy, and provide a precise definition of the envirome (Chapter 1, 3), these and other environmental factors could be said to be in “linkage disequilibrium” with each other. Just as is done for preliminary GWAS findings, EWAS findings can and should be used to identify further factors that may be in “disequilibrium,” for further detailed measurement and causal identification.

EWAS allows for comprehensive and systematic analysis of the effects of the environment in association to disease on a broad scale. While many investigators have already utilized the NHANES to address the effect of a limited number of factors on disease, they do not provide a global view of these associations [277, 278]. Further, the previous studies use differing definitions of disease status (such as a medical questionnaire), exposure coding (discretization vs. log transformation), and lack methods for multiple comparison control [279-281]. It is the well-established toolkit of the GWAS

127 that has provided us with methods to overcome these limitations and to enable us to postulate about environment-wide association to disease.

Limitations of these studies remind us that measuring environment-wide aspects in relation to phenotypic states such as disease will be a difficult undertaking [10]. While the NHANES provides a large number of factors to study, a comprehensive assessment will require precise definition over a broader dimension (including more factors). While laboratory measurements are collected during a baseline fasting state for all participants in NHANES, we will still have to account for the dynamic and heterogeneous nature of different exposures and their associated responses by taking replicate measurements at different physiological states.

Furthermore, the observed cross-sectional correlations in the EWAS setting do not offer proof for causality (Chapter 3). While we attempted to check the validity of our estimates by systematically adjusting for known, self-reported confounders, residual confounding and confounding from unmeasured variables cannot be ruled out and reverse causality remains always a possibility for findings of cross-sectional data. We have also shown how to systematically evaluate the correlation pattern between known and novel environmental correlates of lipids to communicate the complex inter- relationships between these variables. We hypothesize that this approach would be helpful in designing future studies such randomized trials that may try to intervene at one or more nodes in the correlation globes.

To more formally ascertain causality, we would need to perform prospective EWAS over the life course, consider incident cases, consider randomization methods [69], or even evaluate gene-environment interactions as additional validation (Chapters 3 and 5). Due to the number of hypotheses generated, we would need to integrate more evidence from large-scale collaborative studies

128 in order to confirm (or refute) etiological aspects of these factors while being as comprehensive as possible in the observation of potential confounding variables. For example, additional factors such as behavior (food consumption, drug use, and/or exercise patterns), geographic location, and occupation must also be ascertained to account for associated risk factors and reverse causality.

The measurement of 300 environmental factors is hardly a comprehensive study of the environment, but this is still a greater number of factors measured than the 30 microsatellite markers [282], or 100 single nucleotide polymorphisms (SNPs) in some of the earliest implementations of GWAS [283]. We suggest that measurement technologies for the environment can and will improve in resolution, as novel associations are made using even few measurements in these prototype studies. Measurement of the panel of environmental factors used here, most of which are performed by mass spectrometry, currently costs an estimated $40,000 per individual [284], or close to the current pricing for candidate SNPs and copy number variation sequencing.

Another type of hypothesis we may generate is regarding the complex cause of disease. For example, we can now use an EWAS to hypothesize about “gene- environment” interactions and their relation to disease etiology. In the next chapter, we address how to screen for gene-environment interactions through integration of GWAS and EWAS, where genetic variability is assessed simultaneously along with robustly identified environmental factors. As will be seen, while resource intensive, this type of study design could perhaps facilitate an explanation of disease causation that has eluded genomic-wide scans, provide additional validity for the marginal effect of exposure, and enable more accurate estimates of risk [32].

129 CHAPTER 5: TOWARD ENVIROME-GENOME INTERACTIONS IN THE CONTEXT OF HUMAN HEALTH: COMPREHENSIVELY SCREENING FOR GENE-ENVIRONMENT INTERACTIONS IN ASSOCIATION TO TYPE 2 DIABETES.

INTRODUCTION In previous chapters, we focused on comprehensive and agnostic methods to attain robust environmental disease associations on a population scale, notably known as EWAS (Chapter 3) and we applied the method to find factors robustly associated with Type 2 Diabetes and serum lipid levels (Chapter 4chapter).

It is hypothesized that both multiple genetic and environmental factors interact to induce complex disease. Genome-wide association studies (GWAS) have led to the discovery of many common variants associated with disease [9, 203]; however, each of these common genetic variants confer very modest risks and cumulatively explain only a limited portion of the disease variance [32]. It is hypothesized that some of the yet unexplained risk may be accounted for by “gene-environment interactions”, or that the joint effect of genetics and of the environment may be different than the marginal effects of each of these two factors alone [32, 44, 47, 285, 286].

In the following, we introduce a method for screening for gene-environment interactions between prevalent environmental factors found in EWAS and common variants found in GWAS in application to T2D, overcoming a few outstanding challenges in the field. Before describing these challenges, the method, and application we first define and report some examples of the gene- environment interactions in context of disease.

130

Background A classic example of a gene-environment interaction involves the disease phenylketonuria (PKU) [287]. Those with PKU have inherited a rare genetic variant that codes for a deficient phenylalanine hydroxylase liver enzyme and are unable to convert the amino acid compound phenylalanine from their diets to another amino acid, tyrosine. In the presence of both the deficient enzyme and phenylalanine, an intermediate compound accumulates, phenylketone, leading to mental retardation. However, even with the rare genetic variant coding for the deficient protein, controlling phenylalanine exposure mitigates adverse phenotypes.

The study of gene-environment interactions is akin to “pharmacogenetics,” [81, 288] which relates genetic differences and variability of molecular responses to drugs . In fact, the term “ecogenetics” – the study of gene-environment interactions [289] -- discriminates environmental responses from drug responses. Over 8 decades ago Archibald Garrod undertook initial studies regarding genetics and metabolic-focused response to environmental chemicals, observing “inborn errors of metabolism” in adverse phenotypes such as alkaptonuria [290]. In 1931, Garrod further described adverse phenotypes that occur only in certain individuals, “…substances contained in particular foods, certain drugs, and exhalations of animals or plants produce in some people effects wholly out of proportion to any which they bring about in average individuals (sic)” [291]. This observation was the first classic pharmacogenetic “responder” vs.“non-responder” phenotypes that would come to dominate the field. Later, Motulsky set the stage for pharmacogenetics (and later ecogenetics) in which he described the adverse response to drugs as an environmental and dose-dependent “trigger” for genetically susceptible individuals [292].

131 The interaction of human genetic variants and specific environmental factors can be assessed through population-based studies, in which the presence of both a genetic variant and factor is associated with a disease phenotype [289, 293]. In statistical models, the hypothesis of joint effect is tested against the marginal association between each of the factors alone and phenotype, as to be discussed below. However, as both epidemiologists and toxicologists alike would note, population-based statistical interaction does not mean biological or molecular interaction [34, 294, 295]. Nevertheless, presence of a statistical interaction enables us to hypothesize about underlying biological processes that occur between genes and environmental factors.

Most documented gene-environment interactions between genetic variants and chemical factors come in the context of genes that control metabolic processes, or “pharmacokinetic” genes, such as the direct conversion of chemical factors to products for use or excretion. Often these interaction studies occur amongst a finite set of diseases, most notably cancer. A famous example includes the product of CYP1A1, which oxidizes polyaromatic hydrocarbons. Early hypotheses surrounded the metabolic efficiency of different variants of this gene and hydrocarbons in lung cancer [296]. Molecular processes involving acetylation carried out by the N-acetyltransferase and associated proteins (NATs) have received the most attention in relation to variable host responses. For example, altered function due to NAT variants and exposure to cigarette smoke in the context of colon and bladder cancer has been well studied [297, 298], gathering robust evidence in GWAS [45].

Evidence for interactions of specific factors has been less strong for T2D. First, the environment is often attributed to abstract factors such as “lifestyle” or a proxy for a collection of environmental influences, such as body mass index [299-302], but there exceptions, such as the interaction hypothesized between variants in PPARG (Pro12Ala) and dietary fat intake [303]. More surprisingly,

132 there are few examples of interaction between the strongest hits from GWAS for T2D and environmental factors, such as between rs7903146 (TCF7L2) and dietary carbohydrate, although the strength of association was weak to moderate [304]. What is needed is a method to screen a space of possible interactions to prioritize further study.

Screening for Gene-Environment Interactions: “G-EWAS” Despite the hypothesis that gene-environment interactions play a role in diseases of multifactorial nature such as T2D, there is an absence of documented evidence for specific gene-environment interactions for the disease. Investigating gene-environment interactions is a challenging undertaking. First, analyzing gene-environment interactions is a complex and power-intensive exercise [47]. Second, traditionally, most population-based epidemiological studies examine either only genetic risk factors or only environmental risk factors; there is a smaller set of studies that capture information comprehensively on both genetics and the environment. It is quite rare for significant numbers of genotypes and environmental factors to be measured concurrently. Even still, there is another practical challenge that we propose to address using comprehensive analytic techniques introduced in Chapter 3, the selection of what candidate factors to measure in the first place.

The outstanding practical challenge we address here revolves around the domain of factors: which of the millions of genetic variants or thousands of environmental factors do we choose to measure jointly? Often, genetic variants and environmental factors are selected by convenience, without sufficient documentation of the strength of their marginal associations. It is possible that given the complexity of gene-environment interaction analyses [47], there is a problem with selective analyses and selective reporting of only some of the results from each study in a fragmented and possibly biased fashion [42, 305]. Many studies do not account for the multiplicity of all the

133 interaction effects that they have explored. There is a need to select common variants and exposures resulting from comprehensive studies, and in turn, systematically screen their interactions to avoid the spurious results seen in many candidate-driven investigations [299, 301, 306].

Instead of the traditional candidate locus and environmental factor approach, one new way forward would be to screen a set of gene and environmental factors and use the “best hits” as candidates for further study and validation. To construct such a screen, we propose utilizing factors arising from comprehensive and systematic studies that have resulted in robust and replicated associations with disease of interest. For example, much has been written about using genome-wide association studies (GWAS) to find common genetic variants associated with complex disease [203]. We recently published an analogous approach for finding associated environmental factors, called an environment-wide association study (EWAS) [35].

We propose here a systematic approach to select and test gene-environment interactions in association to a common disease such as T2D, specifically testing the interaction between robust factors found in GWAS and EWAS. We are able to conduct this study because of the specific nature of the Centers for Disease Control (CDC) National Health and Nutrition Examination Survey (NHANES) [37], which we introduced earlier in Chapter 4. To recap, the survey includes 261 genotyped loci, more than 50 environmental factors measured in blood and urine (e.g. nutrients, vitamins, and pollutants), and clinical biomarkers (such as fasting blood glucose) for the same individuals. Focusing on the top GWAS and EWAS hits on T2D, we systematically investigate variant-environment interactions in association with the disease on these cohorts, creating hypotheses for further investigation.

134 METHODS Figure 27 schematically shows the systematic approach for testing gene- environment interactions. The analysis of interactions is conducted in a dataset that has available measurements for both genetic variants and environmental factors. We select genetic variants and environmental factors that have strong evidence of association for their marginal effects.

For genetic associations, the current paradigm of GWAS has provided the framework to assemble robustly replicated sets of common genetic variants with previously documented genome-wide significance (p<5x10-8 [8]). For environmental associations, we conducted EWAS to comprehensively search for and validate prevalent environmental factors in association to T2D (Chapter 4). For environmental variables, there is less strong consensus on what are robust enough standards of replication [307] and it should be acknowledged that, in contrast to genetics, reverse causality cannot be easily excluded. Here we selected environmental exposures that have shown significant associations in at least two (and up to 4) independent cohorts after accounting for the multiplicity of analyses and after adjusting for demographic factors.

First, we examined the marginal effects of each of the “G” number of genetic variants or “E” number of environmental factors on T2D separately. Second, we computed the association between each environmental factor and variant pair (total of E x G tests) to ascertain the degree of their dependence. In our main screen, each environmental factor and variant pair (total of E x G tests) is tested for interaction while adjusting for other known risk factors (Figure 27B). Finally, multiplicity of analyses is accounted for with both Bonferroni-adjusted p-values and false discovery rate (FDR) estimation (Figure 27C).

135 A Factor Selection -carotene -caroteneβ β -tocopherol cis- trans-γ heptachlorPCB199 epoxide B Test for Interaction 1 2 3 4 5 rs10923931 NOTCH2 1 rs13266634 rs7903146 TCF7L2 2 (2) (# of risk alleles) rs13266634 SLC30A8 3 (1) rs7901695 TCF7L2 4 126 mg/dL) 126

rs2383208 CDKN2A 5 ≥ (0) rs1260326 GCKR 6 rs780094 GCKR 7 rs2237895 KCNQ1 8 logit(FBG logit(FBG rs10811661 unknown 9 zγ-tocopherol rs4712523 CDKAL1 10 rs4607103 ADAMTS9 11 C Multiplicity Adjustment rs1111875 HHEX 12 rs7578597 THADA 13 Bonferroni Correction (meffective= Σ × Σ ) rs4402960 IGFBP2 14 FDR (estimated using parametric bootstrapping) rs1801282 PPARG 15 rs12779790 CAMK1D 16 D Sensitivity Analyses rs8050136 FTO 17 Caucasian sub-group > Age 40 sub-group rs864745 JAZF1 18 Figure 27. Schematic for comprehensive testing and screening for gene-environment interactions against T2D. A.) Genetic and environmental factors are chosen by their strength of marginal association in GWAS and EWAS, B.) Each genetic variant and exposure pair is tested for interaction in association to disease (example shown: γ-tocopherol and the variant rs13266634 in association to T2D (fasting blood glucose [FBG] > 125 mg/dL)) in a logistic regression model adjusting for other risk factors and main effects of exposure and variant, C.) Multiple hypotheses are controlled for using a modified Bonferroni correction and the FDR is estimated. The correction factor for the multiplicity adjustment is the number of estimated independent tests conducted, and the empirical false positive rate for FDR estimation is estimated through a parametric bootstrap approach. D.) Sensitivity analyses are conducted, restricting the samples analyzed to a Caucasian-only subgroup and an over-age-40 subgroup.

Data and selected genetic and environmental factors We used data from the National Health and Nutrition Examination Survey (NHANES) described in Chapter 4 [37]. On the genetics side, we considered 18 genetic variants that have been previously documented through GWAS to have robust association (reaching genome-wide signficance) with T2D. These variants have been assayed among consenting individuals in two NHANES surveys, those conducted in 1999-2000 and 2001-2002. A total of 8000 individuals from these cohorts had both consented use of their DNA for

136 research and had blood samples available for genetic testing. Genetic variants were chosen a priori by different groups of independent researchers investigating other research topics. We computed allele frequencies of each variant stratified on self-report race to confirm their presence. In NHANES, this was coded in 5 groups (Mexican American, Non-Hispanic Black, Non- Hispanic White, Other Hispanic, Other).

On the environment side, we previously identified and tentatively validated 5 environmental factors associated with T2D, including trans-β-carotene, cis-β- carotene, γ-tocopherol, heptachlor epoxide, and PCB170 after systematically screening 266 environmental factors measured by blood or urine tests (Chapter 4) [10]. To recap, the false-discovery rate for each of these 5 associations was less than 10% in at least 2 independent cohorts and the overall FDR, assessed by considering all combinations of attaining significance in more than 1 cohort, was less than 1% for all 5 factors.

T2D cases are defined as individuals who had 8.5-hour fasting blood glucose (FBG) greater than or equal to 126 mg/dL as advised by the American Diabetes Association (ADA), similar to our EWAS on T2D (Chapter 4). To increase our power for detection of interaction effects, we combined data from the two cohorts. Depending on the genetic and environmental variables tested for interaction, there were a total of 921 to 2924 controls and 82 to 278 cases.

Each genetic variable was coded for the number of risk alleles as designated by the papers from which they were found [23, 308]. Environmental factors were log-transformed and standardized (expressed in standard deviation units) .

Regression Analyses We assessed the marginal effect between the genetic variant or environmental factor on T2D with survey-weighted logistic regression, adjusting for self- report race, age, sex, and BMI. Next, we ascertained whether genetic variants

137 might be correlated with levels of environmental factor. We evaluated the correlation between genetic and environmental factors through survey- weighted linear regression, regressing log base 10 of the environmental exposure variable on each genetic variable, adjusted for self-report race, sex, age, and BMI. We used 4-year survey weights corresponding to the smallest subsample analyzed as advised by the National Centers for Health Statistics (NCHS) [207].

Next, we conducted our systematic interaction screen (Figure 27A-B). Specifically, we screened the space of possible pairs of interactions, totaling 90 (18 genetic loci times 5 environmental factors). We utilized survey-weighted logistic regression to associate each pair of factors to T2D incorporating a multiplicative interaction term and main effects of both factors. Each model was further adjusted by age, sex, self-reported race, BMI. As above, 4-year survey weights corresponding to the smallest subsample analyzed as advised by the NCHS [207].

Multiplicity Correction and FDR We corrected for multiple hypotheses through direct Bonferroni correction of the statistical significance level and FDR estimation (Figure 27C). Bonferroni multiplicity correction adjusts the threshold for statistical significance (for example, p=0.05) by the number of statistical tests conducted. Since our tests are not independent, we estimated the total number of “effective” genetic loci and environmental exposures tested jointly by taking into account the correlation between the selected factors. This approach, which more accurately estimates the number of hypotheses for a group of correlated factors, has been applied previously to the study of genetic variants [309]. Here, we expanded the use of the method for environmental factors. For the 18 genetic loci, we calculated the correlation between the genetic factors stratified by ethnicity and concluded that there were 17.7 effective genetic factors. For the 5 environmental factors, we calculated 4.41 effective factors. Thus, the total

138 number of effective tests was 78.1 (17.7 x 4.41). The adjusted level of significance for a single test threshold of p=0.05 therefore is 0.0006 (0.05/78.1).

We also calculated the FDR, the estimated proportion of false positives among the total of significant hypotheses for a given significance level [238]. To estimate the number of false positives, we empirically generated a distribution of null test statistics corresponding to the interaction term while preserving the main effects of the variant and exposure terms using a parametric bootstrap method [310].

Briefly, a parametric bootstrap approach samples with replacement responses from a model representing the “null” hypothesis many times to enable the creation of a null distribution of test statistics corresponding to the interaction term. To create the null distribution of test statistics corresponding to the interaction term (β GxE), we fit a “null” logistic regression model omitting the interaction term (β GxE = 0) while leaving in the model the parameters modeling the main effects of the environmental factor, genetic variant, and remaining covariates (age, sex, race, and BMI). We bootstrapped the responses from this null model and refit the original model described above, adding the covariate corresponding to the interaction between variant and environmental factor. To estimate a null distribution, this procedure is repeated 100 times. Finally, the FDR is estimated to be the ratio of the results of our interaction term called significant in the null distribution and all the results called significant (both true and false positives) at a given significance level.

All presented analyses include data from diverse ancestry and age groups, as reflected in the US population that is sampled by NHANES. Given that the stronger evidence for the specific T2D associations has been procured by

139 studies in Caucasian-descent individuals, we also performed a sensitivity analysis limiting the data to participants who were coded as non-Hispanic white and other Hispanic (Figure 27D). Next, as NHANES is cross-sectional and presumably many genetically at risk participants might not be diagnosed at time of sampling, we performed an analysis using an older set of individuals (greater than 40 years of age).

BMI is a notable risk factor for T2D [311, 312]. As means of comparison, we also sought to document any interaction between the 18 genetic variants and BMI. To conduct these interaction tests, we standardized BMI by centering the measurements about the mean and dividing by the standard deviation. As above, we fit a survey-weighted logistic regression model, modeling diabetes status as a function of the main effect of the variant (coded as number of risk alleles), main effect of BMI, interaction term, in addition to sex, age, and self- report race. We estimated greater than 90% power to detect interactions between BMI and these genetic loci for interaction OR 1.5 at the 0.05 level of significance [313].

For all analyses, we used SAS version 9.2 accessed through a Remote Data Center (RDC) located in Hyattsville, Maryland. As NHANES is a complex, multi-stage, stratified survey, we utilized survey sampling units, strata, and weights for all analyses as in the previous chapter [207].

RESULTS We implemented a systematic screen for detecting interaction effects between 19 genetic variants identified in GWAS on T2D and 5 environmental factors identified in EWAS on T2D (Chapter 4) [35], a total of 90 interaction tests (Figure 27A). We modeled T2D using logistic regression with a multiplicative interaction term between each genetic variant and environmental factor pair while adjusting for age, sex, self-report race, and body mass index (BMI)

140 (Figure 27B). We assessed multiple hypotheses using a Bonferroni and a false discovery rate (FDR) approach (Figure 27C) and conducted sensitivity analysis to assess the robustness of our results (Figure 27D). We begin by assessing power, marginal associations, and correlation between genetic variants and environmental factors.

Allele frequencies We estimated the minor allele frequencies of each of the 18 variants in our two US NHANES cohorts. For most of the loci, minor allele frequencies were greater than 5% for all of the self-reported ethnicities. The only exception was rs1801282, which showed a 3% minor allele frequency in self-reported Blacks. This suggests that all the surveyed ethnicities had a reasonable frequency of minor alleles at these loci for study.

Power Calculations In order to study the interactions between concurrently measured genotypes and environmental factors in our two US NHANES cohorts, we determined whether we had sufficient power to proceed. Power computations for genotype-environment interactions depend on minor allele frequency for loci, environmental factor variability, ratio of cases to controls, and marginal associations to disease [313]. We estimated the minor allele frequencies from our cohorts (5-44%), the exact ratio of cases and controls available for each genotype-environment factor pair, assumed standardized environmental variables (SD=1), and assumed a marginal OR of 1.2 and 2.0 for the genetic loci and environmental factor (gathered from previous GWAS and EWAS) respectively. Under these assumptions, we determined to have or 30-96% (median=71%) and 63-99% (median=98%) power for 82 to 278 cases (921 to 2924 controls) to detect an interaction OR of 1.5 and 2.0 respectively for a significance threshold α=0.05 [313] (Figure 28).

141 T2D Power: p=0.05, OR=1.5

● ● ● ● ● ● ● ● ● ● ● ● ● ● 0.9 ● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● ●● 0.8 ●●

● ●● ● ●● ● 0.7 ●● 0.6 power ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ●● ● ●

0.5 ● ● ● ● ● ● PCB170 ● trans−β−carotene ● 0.4 ● ● cis−β−carotene ● γ−tocopherol ● ● ● ● ● ● ● Heptachlor Epoxide 0.3 rs8050136(FTO) rs780094(GCKR) rs864745(JAZF1) rs1260326(GCKR) rs7578597(THADA) rs1801282(PPARG) rs2237895(KCNQ1) rs7901695(TCF7L2) rs7903146(TCF7L2) rs4712523(CDKAL1) rs4402960(IGF2BP2) rs1111875(Unknown) rs2383208(Unknown) rs4607103(Unknown) rs10923931(NOTCH2) rs10811661(Unknown) rs12779790(Unknown) rs13266634(SLC30A8)

Figure 28 Power estimation for detection of interaction for each genetic locus and environmental factor pair tested against T2D (FBG > 125 mg/dL) [313]. Assumptions include an interaction odds ratio of 1.5, a main effect of genetic locus of 1.2 and environmental factor 2.0 estimated from previous studies, minor allele frequencies as in Supplementary Table 1, approximately 10 controls per case, environmental factor SD of 1, and p-value of 0.05. Markers alternate between filled and open for each locus.

Marginal Associations Three of 18 genetic variants were marginally associated with T2D in the NHANES cohorts at significance level of 0.05 (uncorrected for multiple hypotheses here, given that they have been previously robustly documented to be associated with T2D) after adjustment for age, sex, race, and BMI. These included loci rs10923931 (NOTCH2), rs7903146 (TCF7L2), and rs13266634 (SLC30A8) (Table 1). In addition to these genetic variants, we also list here the five environmental factors we previously found associated strongly with T2D in cohorts examined here (Table 8) [35].

142

Locus (gene) or N(cases) p-value OR (95% CI) Environmental Factor rs10923931(NOTCH2) 3429 (297) 0.0043 1.50 (1.14,1.98) rs7903146(TCF7L2) 3401 (296) 0.015 1.32 (1.06,1.65) rs13266634(SLC30A8) 3427 (298) 0.018 1.33 (1.05,1.69) rs1260326(GCKR) 3408 (296) 0.13 1.27 (0.93,1.73) rs7901695(TCF7L2) 3402 (293) 0.13 1.22 (0.94,1.57) rs780094(GCKR) 3430 (298) 0.15 1.25 (0.92,1.70) rs4607103(Unknown) 3421 (298) 0.41 0.89 (0.68,1.17) rs2383208(Unknown) 3122 (298) 0.42 0.86 (0.60,1.24) rs4402960(IGF2BP2) 3406 (296) 0.52 0.93 (0.74,1.17) rs7578597(THADA) 3416 (291) 0.52 0.88 (0.59,1.30) rs12779790(Unknown) 3415 (293) 0.58 0.91 (0.66,1.27) rs2237895(KCNQ1) 3415 (296) 0.61 0.95 (0.77,1.17) rs1801282(PPARG) 3405 (296) 0.62 0.90 (0.58,1.38) rs4712523(CDKAL1) 3431 (298) 0.63 1.07 (0.81,1.42) rs8050136(FTO) 3403 (295) 0.74 0.96 (0.75,1.23) rs1111875(Unknown) 3407 (296) 0.75 1.04 (0.83,1.30) rs864745(JAZF1) 3430 (298) 0.8 0.96 (0.70,1.31) rs10811661(Unknown) 3406 (296) 0.84 0.96 (0.62,1.48)

trans-β-carotene 3033 (189) 5 x 10-5 0.64 (0.52,0.79)* γ-tocopherol 5349 (314) 5 x 10-5 1.46 (1.25,1.72)* cis-β-carotene 3032 (189) 2 x 10-4 0.63 (0.50,0.81)* PCB170 1807 (98) 0.005 1.72 (1.18,2.52)* Heptachlor Epoxide 1711 (94) 0.005 1.49 (1.13,1.98)* Table 8. Marginal association of each locus (n=18) or environmental factor (n=5) to T2D (FBG > 125 mg/dL). Per-risk-allele ORs are shown, adjusted by sex, age, ethnicity, and BMI. *Per 1 SD OR are shown, adjusted by sex, age, ethnicity, and BMI.

Correlation between genetic variants with environmental variables We found little evidence for correlation between the 18 genetic variants and the 5 environmental factors. Nominal relationships included a negative association between rs10923931 and Heptachlor Epoxide (p=0.02), where levels of Heptachlor Epoxide decreased 10% per risk allele. We also observed a negative association between rs10923931 and cis-β-carotene (p=0.04), where levels of Heptachlor Epoxide deceased 5% per risk allele.

143 Screening for Genetic Variant by Environment Interactions We then proceeded to study interactions between 18 genetic variants and the 5 environmental factors, a total of 90 interactions tested using survey-weighted logistic regression adjusted for age, sex, self-reported race, and BMI. Figure 29 presents a Manhattan-style plot where all the 90 interaction terms are plotted with their p-values. From these 90, we found 8 results at p < 0.05 and false discovery rates between 1.5 and 37%, involving six genetic variants and four environmental factors. Further, from these 90, we found 4 results with FDR less than 20% (p < 0.01) involving 2 variants and 3 environmental factors worth pursuing for further study.

● PCB170 ● trans−β−carotene ● cis−β−carotene 0.015 ● 4 ● γ−tocopherol ● Heptachlor Epoxide 3 ● 0.06

0.16 ● ● 0.18 2 0.22 ● 0.24 ● ● 0.24 ● 0.37 ● ● ● ● 1 ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●●● ● ● ● ● log10(pvalue interaction term) interaction − log10(pvalue ● ● ● ● ● ● ● ● ● ●● ● ●● ● ● ● ● ● ● ● ● ● ● ● ●● ● ● ● ● ● ● ● ● ● ●● ● ●

0 ● ● ● rs8050136(FTO) rs780094(GCKR) rs864745(JAZF1) rs1260326(GCKR) rs7578597(THADA) rs1801282(PPARG) rs2237895(KCNQ1) rs7901695(TCF7L2) rs7903146(TCF7L2) rs4712523(CDKAL1) rs4402960(IGF2BP2) rs1111875(Unknown) rs2383208(Unknown) rs4607103(Unknown) rs10923931(NOTCH2) rs10811661(Unknown) rs12779790(Unknown) rs13266634(SLC30A8) Figure 29. Manhattan plot of significance values of interaction term (-log10(p-value) for interaction term of pair of factors). The x-axis is grouped by variant (n=18); within each group are 5 points corresponding to the environmental factor tested in interaction with that variant. Top 8 factors (p-value ≤ 0.05) are annotated with their false discovery rate. For example, the interaction between rs13266634 (in SLC30A8 gene) and γ-tocopherol is annotated and has a FDR of 18%. The Bonferroni threshold is seen in the dotted line. Markers alternate between filled and open for each locus.

144 The top four of eight findings are discussed here. Our top result included the interaction between the nutrient marker trans-β-carotene and the non- synonymous SNP rs13266634 (SLC30A8) and it was significant beyond the Bonferroni-adjusted cutoff significance level (interaction p = 5 x 10-5, Bonferroni adjusted p-value 0.006, FDR=1.5%). At lower levels of trans-β- carotene, defined as a point value 1 SD below the mean level of the factor, the per-allele effect size (odds ratio, OR) was 1.8, (95% CI: 1.3, 2.6) 40% greater than the marginal effect (Figure 30). The adjusted OR per change in trans-β- carotene levels was protective for those who had 2 risk alleles for the variant (adjusted OR 0.5, 95% CI: 0.4, 0.8), while for those with 0 or 1 risk alleles had negligible effects. We observed similar effects for cis-β-carotene and rs13266634 (Figure 30).

On the other hand, we observed an opposite effect for individuals who carried the rs13266634 risk alleles with rising levels of γ-tocopherol (interaction p=0.0095, FDR=18%). The adjusted OR for individuals with γ-tocopherol levels 1 SD higher than the mean was 1.6 (adjusted 95% CI: 1.3, 2.1), a 25% increase in per-allele adjusted OR when compared to the marginal effect (Figure 30). For individuals below the mean levels of γ-tocopherol, their genetic risk appears mitigated.

While we did not detect a marginal individual association between intergenic SNP rs12779790 and T2D, we observed an interaction with this locus and trans-β-carotene (p < 0.01, FDR = 16%) with T2D (Figure 30). Specifically, the protective effect of trans-β-carotene increased 50%, an adjusted per-SD environmental factor OR of 0.3 (95% CI: 0.2, 0.5) for those with 2 risk alleles compared to 0.6 for the marginal per-SD effect of the factor.

Interestingly, our weakest result (not in top 4), included an interaction between rs7903146 (TCF7L2), the most highly replicated T2D GWAS variant in

145 Caucasian populations as observed in the NHGRI catalog [203], and trans-β- carotene (interaction p=0.04, FDR=40%). While the result may be spurious, we observed that those with 2 risk alleles and low levels of trans-β-carotene have roughly 8% higher OR (1.4, 95% CI: 1.1, 1.9) compared to the significant marginal effect (Figure 30).

146 OR (95% CI) rs13266634(SLC30A8) 1.3 [1.1,1.7] trans-!-carotene (low(-1SD)) 1.8 [1.3,2.6] trans-!-carotene (mean) 1.1 [0.8,1.5] trans-!-carotene (high(+1SD)) 0.67 [0.41,1.1] p-value (FDR):7.8e-05 (0.015)

rs13266634(SLC30A8) 1.3 [1.1,1.7] cis-!-carotene (low(-1SD)) 1.8 [1.3,2.5] cis-!-carotene (mean) 1.2 [0.85,1.6] cis-!-carotene (high(+1SD)) 0.76 [0.47,1.2] p-value (FDR):0.0015 (0.06)

rs12779790(Unknown) 0.91 [0.66,1.3] cis-!-carotene (low(-1SD)) 1.2 [0.72,1.9] cis-!-carotene (mean) 0.78 [0.55,1.1] cis-!-carotene (high(+1SD)) 0.52 [0.34,0.8] p-value (FDR):0.0062 (0.16)

rs13266634(SLC30A8) 1.3 [1.1,1.7] !-tocopherol (low(-1SD)) 0.84 [0.52,1.3] !-tocopherol (mean) 1.2 [0.88,1.5] !-tocopherol (high(+1SD)) 1.6 [1.3,2.1] p-value (FDR):0.0095 (0.18)

rs4402960(IGF2BP2) 0.93 [0.74,1.2] trans-!-carotene (low(-1SD)) 0.82 [0.58,1.2] trans-!-carotene (mean) 1.1 [0.83,1.4] trans-!-carotene (high(+1SD)) 1.4 [1,1.8] p-value (FDR):0.014 (0.22)

rs4712523(CDKAL1) 1.1 [0.81,1.4] trans-!-carotene (low(-1SD)) 1.4 [0.86,2.4] trans-!-carotene (mean) 1.1 [0.76,1.6] trans-!-carotene (high(+1SD)) 0.85 [0.61,1.2] p-value (FDR):0.021 (0.24)

rs2237895(KCNQ1) 0.95 [0.77,1.2] PCB170(low(-1SD)) 0.44 [0.21,0.93] PCB170(mean) 0.61 [0.34,1.1] PCB170(high(+1SD)) 0.85 [0.5,1.5] p-value (FDR):0.023 (0.24)

rs7903146(TCF7L2) 1.3 [1.1,1.6] trans-!-carotene (low(-1SD)) 1.4 [1.1,1.9] trans-!-carotene (mean) 1.1 [0.88,1.5] trans-!-carotene (high(+1SD)) 0.88 [0.58,1.4] p-value (FDR):0.043 (0.37)

0 0.5 1 1.5 2 2.5 3 Per risk allele OR Figure 30. Per-risk allele effect sizes for top putative interactions with p < 0.05. Black markers denote OR for marginal effect of variant; the red markers denote interaction OR computed at low (at 1 SD lower than the mean), mean, or high (at 1SD greater than the mean) levels of exposure respectively. Marker sizes are proportional to inverse variance.

Sensitivity Analyses limited to non-Hispanic white and other Hispanic participants and older individuals In sensitivity analyses limited to only Caucasian participants (self-reported non-Hispanic white and Hispanics, 55 to 58% of the population in the

147 originally analyzed NHANES cohorts), we were able to reconfirm the top 4 interactions (FDR < 20%) found to the extent of their effect and strength of association. Specifically, there was less than 10% change between interaction effect sizes between the Caucasian-only analysis and the full cohort analyzed. Further, the statistical significance of association remained less than 0.05 despite decreased power. We concluded we had limited power to claim the 4- 8th ranked interactions were preserved in this sub-sample (p > 0.05); however, there was a negligible change in their effect sizes.

As NHANES is cross-sectional, there remains a possibility that individuals who are at genetic risk for T2D might not be diagnosed as such at the time of their sampling. To estimate the effect of this bias, we estimated the size of interaction effects for a sub-sample aged 40 and older (63-64% of the population of the originally analyzed NHANES cohorts). When analyzing the sub-sample of older individuals, there was negligible change in interaction effects for all of the top 4 factors (FDR < 20%) and their statistical significance level remained less than 0.05.

Limited Evidence to Support Interactions with Body Mass Index We sought to compare interaction effects between BMI, a notable risk factor for T2D, and the 18 top genetic factors tested in this pilot study. Interestingly, while adequately powered, we were unable to uncover substantial interaction effects that would survive Bonferroni correction. We did observe a modest interaction between BMI and rs8050136 of the FTO gene (uncorrected interaction p-value=0.03). rs8050136 is known to be an obesity related locus whose association with T2D is explained primarily through its effect on BMI [228].

148 DISCUSSION We have shown here how results from two comprehensive association approaches on genetics and the environment, GWAS and EWAS, can be combined to systematically screen for gene-environment interactions. We implemented ways to correct for multiple hypotheses using a modified Bonferroni adjustment method and through estimation of the FDR. In particular, we have implemented a method to estimate the FDR empirically using the parametric bootstrap [310], a less conservative way to mitigate the cost of multiple hypotheses. We propose that the most promising hypotheses that emerge from this systematic process are candidates for replication in additional independent cohorts in prospective studies.

We restricted our analyses to environmental factors and genetic variants on the basis of strength of the evidence on their marginal associations in GWAS and EWAS. One could also consider evaluating gene-environment interactions for genetic loci or environmental factors that do not have robust support for the presence of marginal associations. Given the small marginal effects for most common genetic variants, many genuine associations do not reach genome- wide significance and remain false-negatives [41]. Some of those may have strong interactions with the environment [301] , and may only be discovered if the appropriate joint environmental variables are considered. However, selecting which of the millions of non-genome-wide-significant SNPs to test is challenging. It is well known that testing for interactions is power-intensive [44]; furthermore, testing a large number or all of them imposes an even greater power and multiplicity burden [47]. For environmental factors, the choice of which ones to test for interaction is even more tenuous. Notably, in contrast to common genetic variants, there is yet no high-throughput measurement platform that captures all the environmental factors and lack of measuring capacity limits data availability. Measurement error can be substantial for many environmental exposures [10, 314].

149

We had the ability to screen 266 environmental factors measured in serum and urine systematically through a prior EWAS process for association with T2D. We selected for interaction testing only the 5 of them that had the strongest support, as judged by FDR, persistence of effects after adjustment for confounders, and replication in independent cohorts. The proposed approach creates a systematic list of tested interaction terms, while at the same time it reduces the number of tested interaction terms to a number that is not very high. However, it is still very important to account for multiple hypothesis testing. Here, we used here two approaches, multiplicity correction and the FDR, but other approaches may also be employed [307].

Our application highlights other challenges of testing and validating gene- environment interactions. First, we had low to moderate power to detect moderate interaction effects for some of the interactions tested. Not surprisingly, we found modest p-values of which only one survived the Bonferroni correction and we had modest FDR estimates for the other highest- ranking interactions. This documents that great caution is needed in claiming gene-environment interactions and, more importantly, the need for extensive replication of the top findings in larger well-powered studies. We stress that the current exercise focuses on hypothesis generation.

Replication studies can also examine whether genetic effects and their respective interactions are similar in populations of different ancestry. Population stratification [181] is the equivalent of confounding for genetic effects. We used analyses adjusting for self-report race, however we should acknowledge that the genetic effects identified to-date from GWAS are best documented in Caucasian populations and that self-report ethnicity is subject to bias. Genetic effects for GWAS-discovered markers may be different in different ancestry groups [315-320]. While under-powered, analyses limited to

150 Caucasians showed similar effects for the top hits in our analyses. However, little is known on how gene-environment interactions may behave in populations of different ancestry and this should be further investigated.

Another issue in studying complex and age-related diseases such as T2D includes the classification of cases and controls. For example, a fraction of non-cases at high genetic risk for T2D will not be diagnosed at the time of their sampling. To estimate the effect of this bias, we conducted a sensitivity analyses limiting the cohort to those older than 40. While we observed little difference in our estimates, we acknowledge that effects might be diluted due to this type of bias.

There are but a few documented examples of interaction effects between genome-wide significant loci and environmental or dietary factors on T2D [304]. Through this screen, we have been able to hypothesize about possible new ones. For example, the strongest evidence for interaction in our data existed between rs13266634, a non-synonymous coding variant in the SLC30A8 gene, and three nutrient factors, trans- and cis-β-carotene, and γ- tocopherol. The non-synonymous variant rs13266634 is a highly replicated variant, connected with beta cell function and insulin secretion [29, 50, 64]. For example, in a SLC30A8 knockout mouse model, normal glucose-induced insulin release was preserved; however, after a high fat diet, the SLC30A8 knockout mouse became glucose intolerant and diabetic [49]. Our data-driven gene-environment screen has enabled us to hypothesize that impaired insulin secretion imparted by the rs13266634 variant, combined with presence or deficiency of specific nutrients in the diet, leads to greater risk for T2D. We plan to test this hypothesis in depth in both model systems and in other human populations.

151 Nevertheless, attributing causality of interactions is challenging. For environmental factors, confounding [68] and reverse causality [10] are major issues for studying environmental factors. First, little is known about the causal nature, if any, regarding these environmental factors and T2D [216]. Second, statistical interaction does not equate to biological interaction [295]. Given the modest interaction effect sizes and levels of false discovery, the joint effect of these factors need to be evaluated in independent, larger populations, including prospective cohorts where the time-dependent associations of environmental factors can be assessed. Nevertheless, genetic information can sometimes be helpful in identifying genuine environmental risk factors through Mendelian randomization [69].

Finally, other infrastructure-related challenges remain for the future of systematic screening for gene-environment interactions [44]. First, we lack a complete list of candidate environmental factors regarding the marginal effects of exposure to disease. In comparison, the analogous list of common genetic variants is well known and is constantly being updated [9]. Screening and validating gene-environment interactions is power-intensive and will require new types of biobanks that can accommodate large amounts of environmental and genetic measures measured on the same individuals across multiple studies and cohorts [10]. A straightforward first step includes augmenting current GWAS with environmental information [45, 301] and adopting high-priority exposure measures published by public initiatives such as PhenX (www.phenx.org), whose goal is to build consensus regarding the minimal set of impactful environmental factors to measure in large GWAS-like studies [66, 67]. A systematic approach to measuring and testing the environment – as we have shown in previous chapters -- and its interaction with the genetic profile of individuals may help find and explain a substantial component of the disease risk for some common health conditions or even lead to hypotheses regarding disease pathology.

152 CHAPTER 6: CONCLUSION AND DISCUSSION

In this dissertation we have described and implemented methods to create robust and ranked hypotheses through massive, comprehensive, and systematic association of the envirome to disease and adverse phenotypes, both on molecular and population scale.

In the first method operating on molecular-based toxicogenomics data, we use tools of integrative genomics to merge once disparate datasets, toxicological gene expression responses and disease gene expression states. We developed a generalizable representation of environmentally induced molecular responses called the “Envirome Map”. Assuming that functional states induced by environmental agents are similar to disease, we correlated the Envirome Map to cancer expression states. Specifically, we show how the expression states associated with certain factors, such as Bisphenol A, are correlated with breast cancer, prompting further study. Importantly, the Envirome Map enables hypothesis generation in a scalable and practical way, utilizing data from the public domain in databases such as the Gene Expression Omnibus.

We also have developed and implemented a method to associate the envirome to disease on a population scale, called the “Environment-wide Association Study” (EWAS), analogous to the data-driven genome-wide association study (GWAS). We showed how EWAS provides the benefits of GWAS through transparent and comprehensive reporting. Most importantly, EWAS has enabled the discovery of pollutants and dietary markers in association with common diseases such as Type 2 Diabetes and risk factors for heart disease, serum lipid levels at low levels of false discovery. This type of discovery is not standard practice in current day epidemiology. In fact, it has rung a bell in environmental health circles [1, 197, 200], and has introduced a new area of

153 study to genome scientists [196, 198, 199]. Last, and critically, EWAS calls for the study of the highest-ranking novel and robust associations in depth in different study designs and scenarios.

For example, we present a novel way to systematically screen for gene- environment interactions, integrating results from comprehensive studies on the envirome and genome dubbed a “G-EWAS”. In our prototype G-EWAS on T2D, we screened all possible interactions between robustly identified environmental factors from EWAS and genetic factors in GWAS. Further, we implemented two ways of accounting for multiple hypotheses, and after diverse sensitivity analyses, converged on interaction between a non- synonymous functional variant in the SLC30A8 gene and nutrient markers. With experimental models for the SLC30A8 knockout gene, investigators have hypothesized that the risk associated for T2D comes as a result of a dietary interaction; through G-EWAS we are able to speculate about the role of specific factors for this hypothesized interaction.

We predict that studies like G-EWAS are just the tip of the glacier. Our capability of measuring molecular modalities on a population scale is improving exponentially. We are on the brink of a deluge of genomic sequence data, capable of measuring both genotypes and molecular responses on a massive number of individuals. How will we merge data of different scales with dynamic environmental information to better predict disease? Given that traditionally these data have not been analyzed jointly, we face immense infrastructural and analytic challenges with more complex data modes. It is inevitable new methods of comprehensive inference in the spirit of EWAS and G-EWAS will need to be developed to take advantage of them.

To even begin to design and conduct such studies, collaborative efforts now need to be focused on defining standardizing -- from nomenclature to means of

154 measure -- the “envirome”, a concept we only introduced here. For example, genomic studies have benefitted from the efforts put forth by the National Centers of Biotechnology Information, cataloging genetic information in an accessible manner for the scientific and engineering communities to utilize. Further, standardization enabled projects such the HapMap in which a federation of institutions was assembled to characterize common genomic variation on the planet. Such efforts need to take place to conduct true envirome-wide studies. Surveys such as the National Health and Nutrition Examination Survey may enable us to attain the map of environmental variation, but they are only a start.

Investigators have now associated genomic variation with hundreds of common diseases [321]. Specifically, studies such as GWAS (Figure 5B) have us allowed us to hypothesize about disease etiology and predict genetic risk [203]. As a result, we now have a greater understanding of genetic contribution to disease, but critically, the majority of genetic risk for common disease has yet to be explained. There needs to be more envirome-based studies to achieve a fuller understanding of disease. Furthermore, the investigation of the environment lags behind the genome (Figure 1). To begin closing the gap, we propose conducting many more EWAS beginning with common, multifactorial diseases prioritized by the World Health Organization (e.g., cardiovascular disease, T2D, hypertension, premature births, lung cancer, asthma). For example, as of this writing, we are conducting EWAS on hypertension on multiple European cohorts, on chronic kidney disease with cohorts in the United States, on asthma with cohorts in the United States, and on mothers with premature births in the United States.

As a result of multiple GWAS on many diseases, we are now able to provide clinically relevant information to patients for disease prognosis and prevention [23]. However, we have yet to combine this data with environmental

155 information. For example, how might specific environmental factors modify our genetic risk for disease? Individuals are now quantifying toxicants in their own tissue [322] and aggregated results from multiple EWAS will enable investigators to accurately estimate personal environmental risk. Furthermore, results from G-EWAS have implications for personal genomics whereby we may stratify genetic risk for disease by levels of specific environmental factors. This effort will help us attain truly personalized medicine, whereby specific modifiable environmental attributes can be the new targets of therapeutics based on individuals’ genetic profile.

In this dissertation, we have presented and applied new analytical paradigms to comprehensively connect the environment to disease. Just as in the last 10 years we have witnessed the fruits of genome-wide studies, it is now time to usher in envirome-wide studies for a more complete understanding of etiology aligned towards therapeutics and prevention.

156 REFERENCES

1. Rappaport, S.M. and M.T. Smith, Environment and Disease Risks. Science, 2010. 330(6003): p. 460-461. 2. Schwartz, D. and F. Collins, Medicine. Environmental biology and human disease. Science, 2007. 316(5825): p. 695-6. 3. Klaassen, C.D., ed. Casarett and Doull's Toxicology - The Basic Science of Poisons (7th Edition). 7 ed., ed. C.D. Klaassen2008, McGraw-Hill. 4. Willett, W.C., Balancing life-style and genomics research for disease prevention. Science, 2002. 296(5568): p. 695-8. 5. Christiani, D.C., Combating Environmental Causes of Cancer. New Engl J Med, 2011. 364(9): p. 791-793. 6. Ramachandrappa, S. and I.S. Farooqi, Genetic approaches to understanding human obesity. J Clin Invest, 2011. 121(6): p. 2080-6. 7. The Wellcome Trust Case Control Consortium, Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared controls. Nature, 2007. 447(7145): p. 661-678. 8. Pearson, T.A. and T.A. Manolio, How to interpret a genome-wide association study. J Am Med Assoc, 2008. 299(11): p. 1335-44. 9. Hindorff, L., et al. A Catalog of Published Genome-Wide Association Studies. 2009 [cited 2009 7/28/2009]; Available from: http://www.genome.gov/gwastudies. 10. Ioannidis, J., et al., Researching Genetic Versus Nongenetic Determinants of Disease: A Comparison and Proposed Unification. Sci. Transl. Med., 2009. 1(7): p. 8. 11. Baker, D. and M. Nieuwenhuijsen, eds. Environmental Epidemiology. 2008, Oxford University Press: Oxford. 12. Judson, R.S., et al., In Vitro Screening of Environmental Chemicals for Targeted Testing Prioritization -- The ToxCast Project. Environ Health Perspect, 2009. 118(4). 13. Committee on Toxicity Testing and Assessment of Environmental Agents and National Research Council, Toxicity Testing in the 21st Century: A Vision and a Strategy2007, Washington, D.C.: National Academies Press. 14. Hubal, E.A., Biologically relevant exposure science for 21st century toxicity testing. Toxicol Sci, 2009. 111(2): p. 226-32. 15. Krewski, D., et al., New directions in toxicity testing. Annu Rev Publ Health, 2011. 32: p. 161-78. 16. Wild, C.P., Complementing the genome with an "exposome": the outstanding challenge of environmental exposure measurement in molecular epidemiology. Cancer Epidemiol Biomarkers Prev, 2005. 14(8): p. 1847-50.

157 17. World Health Organization. Global Health Observatory Data Repository. 2011 [cited 2011 8/9/2011]; Available from: http://apps.who.int/ghodata/. 18. Mullis, K., Process for amplifying nucleic acid sequences, Cetus Corporation: USA. 19. Illumina, I. 2011 [cited 7/19/2011 7/19/2011]; Available from: http://www.illumina.com/. 20. National Center for Biotechnology Information. National Center for Biotechnology Information. 2011 [cited 2011 7/18/2011]; Available from: http://www.ncbi.nlm.nih.gov/guide/. 21. Anthony, J.C., The promise of psychiatric enviromics. Br J Psychiatry Suppl, 2001. 40: p. s8-11. 22. Liu, Y.I., P.H. Wise, and A.J. Butte, The "etiome": identification and clustering of human disease etiological factors. BMC Bioinformatics, 2009. 10 Suppl 2: p. S14. 23. Ashley, E.A., et al., Clinical assessment incorporating a personal genome. Lancet, 2010. 375(9725): p. 1525-1535. 24. Kawakami, N., et al., Effects of smoking on the incidence of non- insulin-dependent diabetes mellitus. Replication and extension in a Japanese cohort of male employees. Am J Epidemiol, 1997. 145(2): p. 103-9. 25. International HapMap, C., A haplotype map of the . Nature, 2005. 437(7063): p. 1299-320. 26. Goh, K.I., et al., The human disease network. Proc Natl Acad Sci U S A, 2007. 104(21): p. 8685-90. 27. Alan D. Lopez, et al., eds. Global Burden of Disease and Risk Factors. 2006, The International Bank for Reconstruction and Development / The World Bank: Washington DC. 28. Lettre, G. and J.D. Rioux, Autoimmune diseases: insights from genome- wide association studies. Hum Mol Genet, 2008. 17(R2): p. R116-21. 29. Rutter, G.A., Think zinc: New roles for zinc in the control of insulin secretion. Islets, 2009. 2(1): p. 49-50. 30. Majithia, A.R. and J.C. Florez, Clinical translation of genetic predictors for type 2 diabetes. Curr Opin Endocrinol, 2009. 16(2): p. 100-6. 31. Lyssenko, V., et al., Mechanisms by which common variants in the TCF7L2 gene increase risk of type 2 diabetes. J Clin Invest, 2007. 117(8): p. 2155-2163. 32. Manolio, T.A., et al., Finding the missing heritability of complex diseases. Nature, 2009. 461(7265): p. 747-753. 33. Goldstein, D.B., Common genetic variation and human traits. N Engl J Med, 2009. 360(17): p. 1696-8. 34. Rothman, K., S. Greenland, and T. Lash, eds. Modern Epidemiology, 3rd Ed. 3rd ed. 2008, Lippincott Williams & Wilkins: Philadelphia.

158 35. Patel, C.J., J. Bhattacharya, and A.J. Butte, An Environment-Wide Association Study (EWAS) on type 2 diabetes mellitus. PLoS ONE, 2010. 5(5): p. e10746. 36. Patel, C.J., et al., Non-genetic associations and correlation globes for determinants of lipid levels: an environment-wide association study. In Review, 2011. 37. Centers for Disease Control and Prevention (CDC). National Health and Nutrition Examination Survey. 2009 [cited 2009 9/1/2009]; Available from: http://www.cdc.gov/nchs/nhanes/. 38. Noble, W.S., How does multiple testing correction work? Nat Biotech, 2009. 27(12): p. 1135-1137. 39. Benjamini, Y. and Y. Hochberg, Controlling the false discovery rate: a practical and powerful approach to multiple testing. J R Statist Soc B, 1995. 40. Storey, J.D. and R. Tibshirani, Statistical significance for genomewide studies. Proc Natl Acad Sci U S A, 2003. 100(16): p. 9440-5. 41. Ioannidis, J.P., R. Tarone, and J.K. McLaughlin, The False-positive to False-negative Ratio in Epidemiologic Studies. Epidemiology, 2011. 22(4): p. 450-6. 42. Ioannidis, J.P.A., Why Most Published Research Findings Are False. PLoS Med, 2005. 2(8): p. e124. 43. Miller, E.R., 3rd, et al., Meta-analysis: high-dosage vitamin E supplementation may increase all-cause mortality. Ann Intern Med, 2005. 142(1): p. 37-46. 44. Hunter, D.J., Gene-environment interactions in human diseases. Nat Rev Genet, 2005. 6(4): p. 287-98. 45. Rothman, N., et al., A multi-stage genome-wide association study of bladder cancer identifies multiple susceptibility loci. Nat Genet, 2010. 42(11): p. 978-84. 46. Garcia-Closas, M., et al., NAT2 slow acetylation, GSTM1 null genotype, and risk of bladder cancer: results from the Spanish Bladder Cancer Study and meta-analyses. Lancet, 2005. 366(9486): p. 649-59. 47. Thomas, D., Gene-environment-wide association studies: emerging approaches. Nat Rev Genet, 2010. 11(4): p. 259-272. 48. Ioannidis, J.P., Genetic associations: false or true? Trends Mol Med, 2003. 9(4): p. 135-8. 49. Lemaire, K., et al., Insulin crystallization depends on zinc transporter ZnT8 expression, but is not required for normal glucose homeostasis in mice. Proc Natl Acad Sci USA, 2009. 106(35): p. 14872-7. 50. Nicolson, T.J., et al., Insulin Storage and Glucose Homeostasis in Mice Null for the Granule Zinc Transporter ZnT8 and Studies of the Type 2 Diabetes-Associated Variants. Diabetes, 2009. 58(9): p. 2070-2083. 51. Waters, M.D. and J.M. Fostel, Toxicogenomics and systems toxicology: aims and prospects. Nat Rev Genet, 2004. 5(12): p. 936-48.

159 52. Lamb, J., et al., The Connectivity Map: using gene-expression signatures to connect small molecules, genes, and disease. Science, 2006. 313(5795): p. 1929-35. 53. Barrett, T., et al., NCBI GEO: mining tens of millions of expression profiles--database and tools update. Nucleic Acids Res, 2007. 35(Database issue): p. D760-5. 54. Davis, A.P., et al., Comparative Toxicogenomics Database: a knowledgebase and discovery tool for chemical-gene-disease networks. Nucleic Acids Res, 2009. 37(Database issue): p. D786-92. 55. Lim, E., et al., T3DB: a comprehensively annotated database of common toxins and their targets. Nucleic Acids Res, 2010. 38(Database issue): p. D781-6. 56. Dix, D.J., et al., The ToxCast program for prioritizing toxicity testing of environmental chemicals. Toxicol Sci, 2007. 95(1): p. 5-12. 57. Dudley, J.T., et al., Disease signatures are robust across tissues and experiments. Mol Syst Biol, 2009. 5. 58. Sirota, M., et al., Discovery and Preclinical Validation of Drug Indications Using Compendia of Public Gene Expression Data. Sci Transl Med, 2011. 3(96): p. 96ra77. 59. MAQC Consortium, The MicroArray Quality Control (MAQC) project shows inter- and intraplatform reproducibility of gene expression measurements. Nat Biotech, 2006. 24(9): p. 1151-1161. 60. MAQC Consortium, The MicroArray Quality Control (MAQC)-II study of common practices for the development and validation of microarray- based predictive models. Nat Biotech, 2010. 28(8): p. 827-838. 61. Patel, C. and A. Butte, Predicting environmental chemical factors associated with disease-related gene expression data. BMC Med Genomics, 2010. 3(1): p. 17. 62. Wang, T.J., et al., Metabolite profiles and the risk of developing diabetes. Nat Med, 2011. 17(4): p. 448-53. 63. Dawber, T.R., G.F. Meadors, and F.E. Moore, Jr., Epidemiological approaches to heart disease: the Framingham Study. Am J Public Health Nations Health, 1951. 41(3): p. 279-81. 64. Voight, B.F., et al., Twelve type 2 diabetes susceptibility loci identified through large-scale association analysis. Nat Genet, 2010. 42(7): p. 579-589. 65. Teslovich, T.M., et al., Biological, clinical and population relevance of 95 loci for blood lipids. Nature, 2010. 466(7307): p. 707-713. 66. Hamilton, C.M., et al., The PhenX Toolkit: Get the Most From Your Measures. Am J Epidemiol, 2011. 174(3): p. 253-260. 67. NHGRI. PhenX. 2011; Available from: http://www.phenx.org. 68. Davey Smith, G., Use of genetic markers and gene-diet interactions for interrogating population-level causal influences of diet on health. Genes Nutr, 2010. 6(1): p. 27-43-43.

160 69. Davey Smith, G. and S. Ebrahim, 'Mendelian randomization': can genetic epidemiology contribute to understanding environmental determinants of disease? Int J Epidemiol, 2003. 32(1): p. 1-22. 70. Thorgeirsson, T.E., et al., A variant associated with nicotine dependence, lung cancer and peripheral arterial disease. Nature, 2008. 452(7187): p. 638-42. 71. Liu, J.Z., et al., Meta-analysis and imputation refines the association of 15q25 with smoking quantity. Nat Genet, 2010. 42(5): p. 436-440. 72. Wang, K.S., et al., A meta-analysis of two genome-wide association studies identifies 3 new loci for alcohol dependence. J Psychiatr Res, 2011. 73. Heath, A.C., et al., A Quantitative-Trait Genome-Wide Association Study of Alcoholism Risk in the Community: Findings and Implications. Biol Psychiatry, 2011. 74. Schumann, G., et al., Genome-wide association and genetic functional studies identify autism susceptibility candidate 2 gene (AUTS2) in the regulation of alcohol consumption. Proc Natl Acad Sci U S A, 2011. 108(17): p. 7119-24. 75. Rauch, A., et al., Genetic variation in IL28B is associated with chronic hepatitis C and treatment failure: a genome-wide association study. Gastroenterology, 2010. 138(4): p. 1338-45, 1345 e1-7. 76. Kamatani, Y., et al., A genome-wide association study identifies variants in the HLA-DP locus associated with chronic hepatitis B in Asians. Nat Genet, 2009. 41(5): p. 591-5. 77. Petrovski, S., et al., Common human genetic variants and HIV-1 susceptibility: a genome-wide survey in a homogeneous African population. AIDS, 2011. 25(4): p. 513-8. 78. Sulem, P., et al., Sequence variants at CYP1A1-CYP1A2 and AHR associate with coffee consumption. Hum Mol Genet, 2011. 20(10): p. 2071-7. 79. De Moor, M.H., et al., Genome-wide association study of exercise behavior in Dutch and American adults. Med Sci Sports Exerc, 2009. 41(10): p. 1887-95. 80. Tanaka, T., et al., Genome-wide association study of vitamin B6, vitamin B12, folate, and homocysteine blood concentrations. Am J Hum Genet, 2009. 84(4): p. 477-82. 81. Yang, J.J. and R.M. Plenge, Genomic Technology Applied to Pharmacological Traits. J Am Med Assoc, 2011. 306(6): p. 652-653. 82. Peters, L.L., et al., The mouse as a model for human biology: a resource guide for complex trait analysis. Nat Rev Genet, 2007. 8(1): p. 58-69. 83. Romanoski, C.E., et al., Systems Genetics Analysis of Gene-by- Environment Interactions in Human Cells. Am J Hum Genet, 2010.

161 84. Idaghdour, Y., et al., Geographical genomics of human leukocyte gene expression variation in southern Morocco. Nat Genet, 2010. 42(1): p. 62-7. 85. Judson, R., et al., The toxicity data landscape for environmental chemicals. Environ Health Perspect, 2009. 117(5): p. 685-95. 86. Weis, B.K., et al., Personalized exposure assessment: promising approaches for human environmental health research. Environ Health Perspect, 2005. 113(7): p. 840-8. 87. Wang, Y., et al., PubChem: a public information system for analyzing bioactivities of small molecules. Nucleic Acids Res, 2009. 37(Web Server issue): p. W623-33. 88. Williams-DeVane, C.R., M.A. Wolf, and A.M. Richard, DSSTox chemical-index files for exposure-related experiments in ArrayExpress and Gene Expression Omnibus: enabling toxico-chemogenomics data linkages. Bioinformatics, 2009. 25(5): p. 692-694. 89. Andrew, A.S., et al., Drinking-water arsenic exposure modulates gene expression in human lymphocytes from a U.S. population. Environ. Health Perspect., 2008. 116(4): p. 524-31. 90. Malard, V., et al., Global gene expression profiling in human lung cells exposed to cobalt. BMC Genomics, 2007. 8: p. 147. 91. Wang, W., et al., NDRG3 is an androgen regulated and prostate enriched gene that promotes in vitro and in vivo prostate cancer cell growth. Int J Cancer, 2009. 124(3): p. 521-30. 92. Gottipolu, R.R., et al., One-month diesel exhaust inhalation produces hypertensive gene expression pattern in healthy rats. Environ. Health Perspect., 2009. 117(1): p. 38-46. 93. Bild, A.H., et al., Oncogenic pathway signatures in human cancers as a guide to targeted therapies. Nature, 2006. 439(7074): p. 353-7. 94. Ashburner, M., et al., Gene ontology: tool for the unification of biology. The Gene Ontology Consortium. Nat Genet, 2000. 25(1): p. 25-9. 95. Gohlke, J.M., et al., Genetic and environmental pathways to complex diseases. BMC Syst Biol, 2009. 3: p. 46. 96. Becker, K.G., et al., The genetic association database. Nat Genet, 2004. 36(5): p. 431-2. 97. Mattingly, C.J., et al., The comparative toxicogenomics database: a cross-species resource for building chemical-gene interaction networks. Toxicol Sci, 2006. 92(2): p. 587-95. 98. Tusher, V.G., R. Tibshirani, and G. Chu, Significance analysis of microarrays applied to the ionizing radiation response. Proc Natl Acad Sci U S A, 2001. 98(9): p. 5116-21. 99. National Center for Biotechnology Information. Homologene. 2010 3/2008]; Available from: http://www.ncbi.nlm.nih.gov/homologene. 100. Zeeberg, B.R., et al., High-Throughput GoMiner, an 'industrial- strength' integrative gene ontology tool for interpretation of multiple- microarray experiments, with application to studies of Common

162 Variable Immune Deficiency (CVID). BMC Bioinformatics, 2005. 6: p. 168. 101. R Core Team, R: A language and enviornment for statistical computing, 2008, R Foundation for Statistical Computing: Vienna, Austria. 102. Bossé, Y., K. Maghni, and T.J. Hudson, 1alpha,25-dihydroxy-vitamin D3 stimulation of bronchial smooth muscle cells induces autocrine, contractility, and remodeling processes. Physiol Genomics, 2007. 29(2): p. 161-8. 103. Tijet, N., et al., Aryl hydrocarbon receptor regulates distinct dioxin- dependent and dioxin-independent gene batteries. Mol Pharmacol, 2006. 69(1): p. 140-53. 104. Li, Z., et al., Discrimination of vanadium from zinc using gene profiling in human bronchial epithelial cells, in Environ. Health Perspect.2005. p. 1747-54. 105. Selvaraj, V., et al., Gene expression profiling of 17beta-estradiol and genistein effects on mouse thymus. Toxicol Sci, 2005. 87(1): p. 97-112. 106. Lin, C.Y., et al., Whole-genome cartography of estrogen receptor alpha binding sites. PLoS Genet, 2007. 3(6): p. e87. 107. Chandran, U.R., et al., Gene expression profiles of prostate cancer reveal involvement of multiple molecular pathways in the metastatic process. BMC Cancer, 2007. 7: p. 64. 108. Yu, Y.P., et al., Gene expression alterations in prostate cancer predicting tumor aggression and preceding development of malignancy. J Clin Oncol, 2004. 22(14): p. 2790-9. 109. Landi, M.T., et al., Gene expression signature of cigarette smoking and its role in lung adenocarcinoma development and survival. PLoS ONE, 2008. 3(2): p. e1651. 110. Liu, R., et al., The prognostic role of a gene signature from tumorigenic breast-cancer cells. N Engl J Med, 2007. 356(3): p. 217- 26. 111. Wang, Y., et al., An overview of the PubChem BioAssay resource. Nucleic Acids Res, 2010. 38(Database issue): p. D255-66. 112. Ho, S.M., et al., Developmental exposure to estradiol and bisphenol A increases susceptibility to prostate carcinogenesis and epigenetically regulates phosphodiesterase type 4 variant 4. Cancer Res, 2006. 66(11): p. 5624-32. 113. Shazer, R.L., et al., Raloxifene, an oestrogen-receptor-beta-targeted therapy, inhibits androgen-independent prostate cancer growth: results from preclinical studies and a pilot phase II clinical trial. BJU Int, 2006. 97(4): p. 691-7. 114. Benbrahim-Tallaa, L., et al., Molecular events associated with arsenic- induced malignant transformation of human prostatic epithelial cells: aberrant genomic DNA methylation and K-ras oncogene activation. Toxicol Appl Pharmacol, 2005. 206(3): p. 288-98.

163 115. Bertilaccio, M.T., et al., Vasculature-targeted tumor necrosis factor- alpha increases the therapeutic index of doxorubicin against prostate cancer. Prostate, 2008. 68(10): p. 1105-15. 116. Borden, L.S., Jr., et al., Vinorelbine, doxorubicin, and prednisone in androgen-independent prostate cancer. Cancer, 2006. 107(5): p. 1093- 100. 117. Amato, R.J. and H. Sarao, A phase I study of paclitaxel/doxorubicin/ thalidomide in patients with androgen- independent prostate cancer. Clin Genitourin Cancer, 2006. 4(4): p. 281-6. 118. Kang, J., et al., Subtoxic concentration of doxorubicin enhances TRAIL-induced apoptosis in human prostate cancer cell line LNCaP. Prostate Cancer Prostatic Dis, 2005. 8(3): p. 274-9. 119. Benbrahim-Tallaa, L., et al., Estrogen signaling and disruption of androgen metabolism in acquired androgen-independence during cadmium carcinogenesis in human prostate epithelial cells. Prostate, 2007. 67(2): p. 135-45. 120. Raschke, M., K. Wahala, and B.L. Pool-Zobel, Reduced isoflavone metabolites formed by the human gut microflora suppress growth but do not affect DNA integrity of human prostate cancer cells. Br J Nutr, 2006. 96(3): p. 426-34. 121. Takahashi, Y., et al., Using DNA microarray analyses to elucidate the effects of genistein in androgen-responsive prostate cancer cells: identification of novel targets. Mol Carcinog, 2004. 41(2): p. 108-119. 122. Li, Y., et al., Regulation of gene expression and inhibition of experimental prostate cancer bone metastasis by dietary genistein. Neoplasia, 2004. 6(4): p. 354-63. 123. Koike, H., et al., Insulin-like growth factor binding protein-6 inhibits prostate cancer cell proliferation: implication for anticancer effect of diethylstilbestrol in hormone refractory prostate cancer. Br J Cancer, 2005. 92(8): p. 1538-44. 124. Oh, W.K., The evolving role of estrogen therapy in prostate cancer. Clin Prostate Cancer, 2002. 1(2): p. 81-9. 125. Tokar, E.J., et al., Cholecalciferol (vitamin D3) and the retinoid N-(4- hydroxyphenyl)retinamide (4-HPR) are synergistic for chemoprevention of prostate cancer. J Exp Ther Oncol, 2006. 5(4): p. 323-33. 126. Costello, L.C. and R.B. Franklin, The clinical relevance of the metabolism of prostate cancer; zinc and tumor suppression: connecting the dots. Mol Cancer, 2006. 5: p. 17. 127. Uzzo, R.G., et al., Diverse effects of zinc on NF-kappaB and AP-1 transcription factors: implications for prostate cancer progression. Carcinogenesis, 2006. 27(10): p. 1980-90. 128. Michael, I.P., et al., Human tissue kallikrein 5 is a member of a proteolytic cascade pathway involved in seminal clot liquefaction and

164 potentially in prostate cancer progression. J Biol Chem, 2006. 281(18): p. 12743-50. 129. Uzzo, R.G., et al., Zinc inhibits nuclear factor-kappa B activation and sensitizes prostate cancer cells to cytotoxic agents. Clin Cancer Res, 2002. 8(11): p. 3579-83. 130. Filyak, Y., O. Filyak, and R. Stoika, Transforming growth factor beta-1 enhances cytotoxic effect of doxorubicin in human lung adenocarcinoma cells of A549 line. Cell Biol Int, 2007. 31(8): p. 851-5. 131. Shen, J., et al., Fetal onset of aberrant gene expression relevant to pulmonary carcinogenesis in lung adenocarcinoma development induced by in utero arsenic exposure. Toxicol Sci, 2007. 95(2): p. 313- 20. 132. Waalkes, M.P., et al., Enhanced urinary bladder and liver carcinogenesis in male CD1 mice exposed to transplacental inorganic arsenic and postnatal diethylstilbestrol or tamoxifen. Toxicol Appl Pharmacol, 2006. 215(3): p. 295-305. 133. Waalkes, M.P., et al., Animal models for arsenic carcinogenesis: inorganic arsenic is a transplacental carcinogen in mice. Toxicol Appl Pharmacol, 2004. 198(3): p. 377-84. 134. Devereux, T.R., et al., Map kinase activation correlates with K-ras mutation and loss of heterozygosity on chromosome 6 in alveolar bronchiolar carcinomas from B6C3F1 mice exposed to vanadium pentoxide for 2 years. Carcinogenesis, 2002. 23(10): p. 1737-43. 135. Zanesi, N., et al., Lung cancer susceptibility in Fhit-deficient mice is increased by Vhl haploinsufficiency. Cancer Res, 2005. 65(15): p. 6576-82. 136. Diament, M.J., et al., Inhibition of tumor progression and paraneoplastic syndrome development in a murine lung adenocarcinoma by medroxyprogesterone acetate and indomethacin. Cancer Invest, 2006. 24(2): p. 126-31. 137. Moody, T.W., et al., Indomethacin reduces lung adenoma number in A/J mice. Anticancer Res, 2001. 21(3B): p. 1749-55. 138. Levin, G., et al., Indomethacin inhibits the accumulation of tumor cells in mouse lungs and subsequent growth of lung metastases. Chemotherapy, 2000. 46(6): p. 429-37. 139. Meira, L.B., et al., Cancer predisposition in mutant mice defective in multiple genetic pathways: uncovering important genetic interactions. Mutat Res, 2001. 477(1-2): p. 51-8. 140. Fan, J.G., Q.E. Wang, and S.J. Liu, Chrysotile-induced cell transformation and transcriptional changes of c-myc oncogene in human embryo lung cells. Biomed Environ Sci, 2000. 13(3): p. 163-9. 141. Carvajal, A., et al., Progesterone pre-treatment potentiates EGF pathway signaling in the breast cancer cell line ZR-75. Breast Cancer Res Treat, 2005. 94(2): p. 171-83.

165 142. Kato, S., et al., Progesterone increases tissue factor gene expression, procoagulant activity, and invasion in the breast cancer cell line ZR- 75-1. J Clin Endocrinol Metab, 2005. 90(2): p. 1181-8. 143. Verheus, M., et al., Plasma phytoestrogens and subsequent breast cancer risk. J Clin Oncol, 2007. 25(6): p. 648-55. 144. Nobert, G.S., M.M. Kraak, and S. Crawford, Estrogen dependent growth inhibitory effects of tamoxifen but not genistein in solid tumors derived from estrogen receptor positive (ER+) primary breast carcinoma MCF7: single agent and novel combined treatment approaches. Bull Cancer, 2006. 93(7): p. E59-66. 145. Seo, H.S., et al., Stimulatory effect of genistein and apigenin on the growth of breast cancer cells correlates with their ability to activate ER alpha. Breast Cancer Res Treat, 2006. 99(2): p. 121-34. 146. Lakshmanaswamy, R., R.C. Guzman, and S. Nandi, Hormonal prevention of breast cancer: significance of promotional environment. Adv Exp Med Biol, 2008. 617: p. 469-75. 147. Bergman Jungestrom, M., L.U. Thompson, and C. Dabrosin, Flaxseed and its lignans inhibit estradiol-induced growth, angiogenesis, and secretion of vascular endothelial growth factor in human breast cancer xenografts in vivo. Clin Cancer Res, 2007. 13(3): p. 1061-7. 148. Vogel, V.G., Recent results from clinical trials using SERMs to reduce the risk of breast cancer. Ann N Y Acad Sci, 2006. 1089: p. 127-42. 149. Eliassen, A.H., et al., Endogenous steroid hormone concentrations and risk of breast cancer among premenopausal women. J Natl Cancer Inst, 2006. 98(19): p. 1406-15. 150. Russo, J., et al., Estrogen and its metabolites are carcinogenic agents in human breast epithelial cells. J Steroid Biochem Mol Biol, 2003. 87(1): p. 1-25. 151. Ackerstaff, E., et al., Anti-inflammatory agent indomethacin reduces invasion and alters metabolism in a human breast cancer cell line. Neoplasia, 2007. 9(3): p. 222-35. 152. Green, M., et al., Diallyl sulfide induces the expression of estrogen metabolizing genes in the presence and/or absence of diethylstilbestrol in the breast of female ACI rats. Toxicol Lett, 2007. 168(1): p. 7-12. 153. Walter, G., R. Liebl, and E. von Angerer, Synthesis and biological evaluation of stilbene-based pure estrogen antagonists. Bioorg Med Chem Lett, 2004. 14(18): p. 4659-63. 154. Vegran, F., et al., Overexpression of caspase-3s splice variant in locally advanced breast carcinoma is associated with poor response to neoadjuvant chemotherapy. Clin Cancer Res, 2006. 12(19): p. 5794- 800. 155. Untch, M., et al., Cardiac safety of trastuzumab in combination with epirubicin and cyclophosphamide in women with metastatic breast cancer: results of a phase I trial. Eur J Cancer, 2004. 40(7): p. 988-97.

166 156. Machiels, J.P., et al., Cyclophosphamide, doxorubicin, and paclitaxel enhance the antitumor immune response of granulocyte/macrophage- colony stimulating factor-secreting whole-cell vaccines in HER-2/neu tolerized mice. Cancer Res, 2001. 61(9): p. 3689-97. 157. Murray, T.J., et al., Induction of mammary gland ductal hyperplasias and carcinoma in situ following fetal bisphenol A exposure. Reprod Toxicol, 2007. 23(3): p. 383-90. 158. Uehara, T., et al., A toxicogenomics approach for early assessment of potential non-genotoxic hepatocarcinogenicity of chemicals in rats. Toxicology, 2008. 250(1): p. 15-26. 159. Yager, J.D. and N.E. Davidson, Estrogen carcinogenesis in breast cancer. N Engl J Med, 2006. 354(3): p. 270-82. 160. Dairkee, S.H., et al., Bisphenol A induces a profile of tumor aggressiveness in high-risk cells from breast cancer patients. Cancer Res, 2008. 68(7): p. 2076-80. 161. Buteau-Lozano, H., et al., Xenoestrogens modulate vascular endothelial growth factor secretion in breast cancer cells through an estrogen receptor-dependent mechanism. J Endocrinol, 2008. 196(2): p. 399-412. 162. Subramanian, A., et al., Gene set enrichment analysis: a knowledge- based approach for interpreting genome-wide expression profiles. Proc Natl Acad Sci USA, 2005. 102(43): p. 15545-50. 163. Salonen, J.T., et al., Type 2 diabetes whole-genome association study in four populations: the DiaGen consortium. Am J Hum Genet, 2007. 81(2): p. 338-45. 164. Saxena, R., et al., Genome-wide association analysis identifies loci for type 2 diabetes and triglyceride levels. Science, 2007. 316(5829): p. 1331-6. 165. Sladek, R., et al., A genome-wide association study identifies novel risk loci for type 2 diabetes. Nature, 2007. 445(7130): p. 881-5. 166. McClellan, J. and M.-C. King, Genetic Heterogeneity in Human Disease. Cell, 2010. 141(2): p. 210-217. 167. Hardy, J. and A. Singleton, Genomewide Association Studies and Human Disease. New Engl J Med, 2009. 360(17): p. 1759-1768. 168. Manolio, T.A., L.D. Brooks, and F.S. Collins, A HapMap harvest of insights into the genetics of common disease. J Clin Invest, 2008. 118(5): p. 1590-605. 169. Wetterstrand, K. DNA Sequencing Costs: Data from the NHGRI Large- Scale Genome Sequencing Program. 2011 2011/08/12]; Available from: http://www.genome.gov/sequencingcosts. 170. Frayling, T.M., et al., A common variant in the FTO gene is associated with body mass index and predisposes to childhood and adult obesity. Science, 2007. 316(5826): p. 889-94.

167 171. McCarthy, M.I., et al., Genome-wide association studies for complex traits: consensus, uncertainty and challenges. Nat Rev Genet, 2008. 9(5): p. 356-369. 172. NCI-NHGRI Working Group on Replication in Association Studies, Replicating genotype‚ phenotype associations. Nature, 2007. 447(7145): p. 655-660. 173. Ioannidis, J.P., et al., Replication validity of genetic association studies. Nat Genet, 2001. 29(3): p. 306-9. 174. Christakis, N.A. and J.H. Fowler, The spread of obesity in a large social network over 32 years. N Engl J Med, 2007. 357(4): p. 370-9. 175. Pearson, J.F., et al., Association Between Fine Particulate Matter and Diabetes Prevalence in the U.S. Diabetes Care, 2010. 33(10): p. 2196- 2201. 176. Butte, A.J. and I.S. Kohane, Mutual information relevance networks: functional genomic clustering using pairwise entropy measurements. Pac Symp Biocomput, 2000: p. 418-29. 177. Butte, A.J., et al., Discovering functional relationships between RNA expression and chemotherapeutic susceptibility using relevance networks. Proc Natl Acad Sci U S A, 2000. 97(22): p. 12182-6. 178. Austin, P.C., et al., Testing multiple statistical hypotheses resulted in spurious associations: a study of astrological signs and health. J Clin Epidemiol, 2006. 59(9): p. 964-9. 179. Young, S.S., Acknowledge and fix the multiple testing problem. Int J Epidemiol, 2010. 39(3): p. 934; author reply 934-5. 180. Young, S.S. and M. Yu, Association of bisphenol A with diabetes and other abnormalities. J Am Med Assoc, 2009. 301(7): p. 720-1; author reply 721-2. 181. Smith, G.D., et al., Clustered environments and randomized genes: a fundamental distinction between conventional and genetic epidemiology. PLoS Med, 2007. 4(12): p. e352. 182. Greenland, S., Randomization, Statistics, and Causal Inference. Epidemiology, 1990. 1(6): p. 421-429. 183. Greenland, S. and H. Morgenstern, Confounding in Health Research. Annu Rev Public Health, 2001. 22(1): p. 189-212. 184. Peto, R., et al., Can dietary beta-carotene materially reduce human cancer rates? Nature, 1981. 290(5803): p. 201-208. 185. Omenn, G.S., et al., Effects of a combination of beta carotene and vitamin A on lung cancer and cardiovascular disease. N Engl J Med, 1996. 334(18): p. 1150-5. 186. Hooper, L., A.R. Ness, and G.D. Smith, Antioxidant strategy for cardiovascular diseases. Lancet, 2001. 357(9269): p. 1705-6. 187. Bartell, S.M., W.C. Griffith, and E.M. Faustman, Temporal error in biomarker-based mean exposure estimates for individuals. J Expo Anal Environ Epidemiol, 2004. 14(2): p. 173-179.

168 188. Manly, B.F., Randomization, Bootstrap and Monte Carlo Methods in Biology. 3 ed2007, Boca Raton: Chapman and Hall/CRC. 189. Efron, B., Large-Scale Inference2010, Cambridge: Cambridge University Press. 190. Peter H. Westfall and S.S. Young, Resampling-based Multiple Testing1993, New York: Wiley. 191. Witten, D.M. and R. Tibshirani, Survival analysis with high- dimensional covariates. Stat Methods Med Res, 2010. 19(1): p. 29-51. 192. Tibshirani, R., Regression Shrinkage and Selection via the Lasso. Journal of the Royal Statistical Society. Series B (Methodological), 1996. 58(1): p. 267-288. 193. Zou, H. and T. Hastie, Regularization and Variable Selection via the Elastic Net. Journal of the Royal Statistical Society. Series B (Statistical Methodology), 2005. 67(2): p. 301-320. 194. Hastie, T., R. Tibshirani, and J.H. Friedman, The Elements of Statistical Learning: Data Mining, Inference, and Prediction. 2 ed2009: Springer. 195. Vittinghoff, E., et al., Regression Methods in Biostatistics: Linear, Logistic, Survival, and Repeated Measures Models2005, New York: Springer. 196. Todd, J.A., D'oh! Genes and Environment Cause Crohn's Disease. Cell, 2010. 141(7): p. 1114-1116. 197. Fallin, M.D. and W.H.L. Kao, Is 'X'-WAS the Future for All of Epidemiology? Epidemiology, 2011. 22(4): p. 457-459 10.1097/EDE.0b013e31821d3a9f. 198. Mak, H.C., Trends in computational biology - 2010. Nat Biotech, 2011. 29(1): p. 45-45. 199. Heard, E., et al., Ten years of genetics and genomics: what have we achieved and where are we heading? Nat Rev Genet, 2010. 11(10): p. 723-733. 200. Borrell, B., Epidemiology: Every bite you take. Nature, 2011. 470(7334): p. 320-2. 201. Mathers, C.D. and D. Loncar, Projections of global mortality and burden of disease from 2002 to 2030. PLoS Med, 2006. 3(11): p. e442. 202. Zeggini, E., et al., Meta-analysis of genome-wide association data and large-scale replication identifies additional susceptibility loci for type 2 diabetes. Nat Genet, 2008. 40(5): p. 638-45. 203. Hindorff, L.A., et al., Potential etiologic and functional implications of genome-wide association loci for human diseases and traits. Proc Natl Acad Sci USA, 2009. 106(23): p. 9362-9367. 204. ADA. Diabetes Information -- All About Diabetes. 2009 6/1/2009]; Available from: http://www.diabetes.org/about-diabetes.jsp. 205. Lumley, T., survey: analysis of complex survey samples, 2009. 206. R Development Core Team, R: A language for statistical computing, 2009, R Foundation for Statistical Computing: Vienna, Austria.

169 207. CDC and National Center for Health Statistics (NCHS). National Health and Nutrition Examination Survey Analytic Guidelines. 2003 [cited 2010 2/19/2010]; Available from: http://www.cdc.gov/nchs/data/nhanes/nhanes_03_04/nhanes_analytic_ guidelines_dec_2005.pdf. 208. Cowie, C.C., et al., Prevalence of diabetes and impaired fasting glucose in adults in the U.S. population: National Health And Nutrition Examination Survey 1999-2002. Diabetes Care, 2006. 29(6): p. 1263-8. 209. Abahusain, M.A., et al., Retinol, alpha-tocopherol and carotenoids in diabetes. Eur J Clin Nutr, 1999. 53(8): p. 630-5. 210. Polidori, M.C., et al., Plasma levels of lipophilic antioxidants in very old patients with type 2 diabetes. Diabetes Metab Res Rev, 2000. 16(1): p. 15-9. 211. Arnlov, J., et al., Serum and dietary beta-carotene and alpha- tocopherol and incidence of type 2 diabetes mellitus in a community- based study of Swedish men: report from the Uppsala Longitudinal Study of Adult Men (ULSAM) study. Diabetologia, 2009. 52(1): p. 97- 105. 212. Ford, E.S., et al., Diabetes mellitus and serum carotenoids: findings from the Third National Health and Nutrition Examination Survey. Am J Epidemiol, 1999. 149(2): p. 168-76. 213. Ylonen, K., et al., Dietary intakes and plasma concentrations of carotenoids and tocopherols in relation to glucose metabolism in subjects at high risk of type 2 diabetes: the Botnia Dietary Study. Am J Clin Nutr, 2003. 77(6): p. 1434-41. 214. Wang, L., et al., Plasma lycopene, other carotenoids, and the risk of type 2 diabetes in women. Am J Epidemiol, 2006. 164(6): p. 576-85. 215. Montonen, J., et al., Dietary antioxidant intake and risk of type 2 diabetes. Diabetes Care, 2004. 27(2): p. 362-6. 216. Song, Y., et al., Effects of vitamins C and E and beta-carotene on the risk of type 2 diabetes in women at high risk of cardiovascular disease: a randomized controlled trial. Am J Clin Nutr, 2009. 90(2): p. 429-37. 217. Kataja-Tuomola, M., et al., Effect of alpha-tocopherol and beta- carotene supplementation on the incidence of type 2 diabetes. Diabetologia, 2008. 51(1): p. 47-53. 218. Codru, N., et al., Diabetes in relation to serum levels of polychlorinated biphenyls and chlorinated pesticides in adult Native Americans. Environ Health Perspect, 2007. 115(10): p. 1442-7. 219. Uemura, H., et al., Associations of environmental exposure to dioxins with prevalent diabetes among general inhabitants in Japan. Environ Res, 2008. 108(1): p. 63-8. 220. Rignell-Hydbom, A., L. Rylander, and L. Hagmar, Exposure to persistent organochlorine pollutants and type 2 diabetes mellitus. Hum Exp Toxicol, 2007. 26(5): p. 447-52.

170 221. Wang, S.L., et al., Increased risk of diabetes and polychlorinated biphenyls and dioxins: a 24-year follow-up study of the Yucheng cohort. Diabetes Care, 2008. 31(8): p. 1574-9. 222. Jiang, Q., et al., gamma-tocopherol, the major form of vitamin E in the US diet, deserves more attention. Am J Clin Nutr, 2001. 74(6): p. 714- 22. 223. Burton, G.W., et al., Human plasma and tissue alpha-tocopherol concentrations in response to supplementation with deuterated natural and synthetic vitamin E. Am J Clin Nutr, 1998. 67(4): p. 669-84. 224. Campbell, S., et al., Development of gamma (gamma)-tocopherol as a colorectal cancer chemopreventive agent. Crit Rev Oncol Hematol, 2003. 47(3): p. 249-59. 225. Agency for Toxic Substances and Disease Registry. Heptachlor and Heptachlor Epoxide. 2007 [cited 2009 8/1/2009]; Available from: http://www.atsdr.cdc.gov/tfacts12.html. 226. Office of Water Regulations and Standards, Ambient Water Quality Criteria for Heptachlor, ed. U.S.E.P. Agency. Vol. EPA 440 5-80-052. 1980, Washington, DC: United States Environmental Production Agency. 227. Montgomery, M.P., et al., Incident diabetes and pesticide exposure among licensed pesticide applicators: Agricultural Health Study, 1993- 2003. Am J Epidemiol, 2008. 167(10): p. 1235-46. 228. Zeggini, E., et al., Replication of genome-wide association signals in UK samples reveals risk loci for type 2 diabetes. Science, 2007. 316(5829): p. 1336-41. 229. Heller, D.A., et al., Genetic and environmental influences on serum lipid levels in twins. N Engl J Med, 1993. 328(16): p. 1150-6. 230. Costanza, M.C., et al., Relative Contributions of Genes, Environment, and Interactions to Blood Lipid Concentrations in a General Adult Population. American Journal of Epidemiology, 2005. 161(8): p. 714- 724. 231. Kris-Etherton, P.M., et al., The effect of diet on plasma lipids, lipoproteins, and coronary heart disease. J Am Diet Assoc, 1988. 88(11): p. 1373-400. 232. Schaefer, E.J., Lipoproteins, nutrition, and heart disease. Am J Clin Nutr, 2002. 75(2): p. 191-212. 233. Varady, K.A. and P.J. Jones, Combination diet and exercise interventions for the treatment of dyslipidemia: an effective preliminary strategy to lower cholesterol levels? J Nutr, 2005. 135(8): p. 1829-35. 234. Craig, W.Y., G.E. Palomaki, and J.E. Haddow, Cigarette smoking and serum lipid and lipoprotein concentrations: an analysis of published data. BMJ, 1989. 298(6676): p. 784-8. 235. Kraus, W.E., et al., Effects of the amount and intensity of exercise on plasma lipoproteins. N Engl J Med, 2002. 347(19): p. 1483-92.

171 236. Brook, R.D., et al., Particulate Matter Air Pollution and Cardiovascular Disease: An Update to the Scientific Statement From the American Heart Association. Circulation, 2010. 121(21): p. 2331- 2378. 237. Ezzati, T.M., et al., Sample design: Third National Health and Nutrition Examination Survey. Vital Health Stat 2, 1992(113): p. 1-35. 238. Storey, J.D., A Direct Approach to False Discovery Rates. J R Statist Soc B, 2002. 64: p. 479-498. 239. American Heart Association. Drug Therapy for Cholesterol. 2010 [cited 2010 10/5]; Available from: http://www.heart.org/HEARTORG/Conditions/Cholesterol/Prevention TreatmentofHighCholesterol/Drug-Therapy-for- Cholesterol_UCM_305632_Article.jsp. 240. Ainsworth, B.E., et al., Compendium of physical activities: an update of activity codes and MET intensities. Med Sci Sports Exerc, 2000. 32(9 Suppl): p. S498-504. 241. Nelson, M.E., et al., Physical activity and public health in older adults: recommendation from the American College of Sports Medicine and the American Heart Association. Med Sci Sports Exerc, 2007. 39(8): p. 1435-45. 242. Cohen, J., Statistical power analysis for the behavioral sciences. 2 ed1988, Hillsdale, NJ: Erlbaum. 243. Fryar, C.D., et al., Hypertension, high serum total cholesterol, and diabetes: racial and ethnic prevalence differences in U.S. adults, 1999- 2006. NCHS Data Brief, 2010(36): p. 1-8. 244. Ford, E.S., et al., Hypertriglyceridemia and Its Pharmacologic Treatment Among US Adults. Arch Intern Med, 2009. 169(6): p. 572- 578. 245. Harrison, E.H., Mechanisms of digestion and absorption of dietary vitamin A. Annu Rev Nutr, 2005. 25: p. 87-103. 246. Willett, W.C., Nutritional Epidemiology1998, New York: Oxford University Press. 247. Yusuf, S., et al., Vitamin E supplementation and cardiovascular events in high-risk patients. The Heart Outcomes Prevention Evaluation Study Investigators. N Engl J Med, 2000. 342(3): p. 154-60. 248. Omenn, G.S., et al., Long-term vitamin A does not produce clinically significant hypertriglyceridemia: results from CARET, the beta- carotene and retinol efficacy trial. Cancer Epidemiol Biomarkers Prev, 1994. 3(8): p. 711-3. 249. Redlich, C.A., et al., Effect of long-term beta-carotene and vitamin A on serum cholesterol and triglyceride levels among participants in the Carotene and Retinol Efficacy Trial (CARET). Atherosclerosis, 1999. 145(2): p. 425-32.

172 250. Vivekananthan, D.P., et al., Use of antioxidant vitamins for the prevention of cardiovascular disease: meta-analysis of randomised trials. Lancet, 2003. 361(9374): p. 2017-2023. 251. Mente, A., et al., A Systematic Review of the Evidence Supporting a Causal Link Between Dietary Factors and Coronary Heart Disease. Arch Intern Med, 2009. 169(7): p. 659-669. 252. Willcox, B.J., J.D. Curb, and B.L. Rodriguez, Antioxidants in cardiovascular health and disease: key lessons from epidemiologic studies. Am J Cardiol, 2008. 101(10A): p. 75D-86D. 253. Bender, D., Nutritional Biochemistry of the VItamins2003, Cambridge: University of Cambridge Press. 254. Ogihara, T., et al., Distribution of tocopherol among human plasma lipoproteins. Clin Chim Acta, 1988. 174(3): p. 299-305. 255. Winbauer, A.N., S.S. Pingree, and K.L. Nuttall, Evaluating serum alpha-tocopherol (vitamin E) in terms of a lipid ratio. Ann Clin Lab Sci, 1999. 29(3): p. 185-91. 256. Semmler, A., et al., Plasma folate levels are associated with the lipoprotein profile: a retrospective database analysis. Nutrition Journal, 2010. 9(1): p. 31. 257. Jorde, R., et al., High serum 25-hydroxyvitamin D concentrations are associated with a favorable serum lipid profile. Eur J Clin Nutr, 2010. 258. Smith, K.M., et al., Relationship between fish intake, n-3 fatty acids, mercury and risk markers of CHD (National Health and Nutrition Examination Survey 1999-2002). Public Health Nutr, 2009. 12(8): p. 1261-9. 259. Hu, F.B. and W.C. Willett, Optimal Diets for Prevention of Coronary Heart Disease. J Am Med Assoc, 2002. 288(20): p. 2569-2578. 260. Joshipura, K.J., et al., The Effect of Fruit and Vegetable Intake on Risk for Coronary Heart Disease. Ann Intern Med, 2001. 134(12): p. 1106- 1114. 261. Bassett, C.M., D. Rodriguez-Leyva, and G.N. Pierce, Experimental and clinical research findings on the cardiovascular benefits of consuming flaxseed. Appl Physiol Nutr Metab, 2009. 34(5): p. 965-74. 262. Pan, A., et al., Meta-analysis of the effects of flaxseed interventions on blood lipids. Am J Clin Nutr, 2009. 90(2): p. 288-97. 263. Park, D., T. Huang, and W.H. Frishman, Phytoestrogens as cardioprotective agents. Cardiol Rev, 2005. 13(1): p. 13-7. 264. Xu, X., et al., Studying associations between urinary metabolites of polycyclic aromatic hydrocarbons (PAHs) and cardiovascular diseases in the United States. Sci Total Environ, 2010. 408(21): p. 4943-4948. 265. Pope, C.A., III, et al., Cardiovascular Mortality and Long-Term Exposure to Particulate Air Pollution: Epidemiological Evidence of General Pathophysiological Pathways of Disease. Circulation, 2004. 109(1): p. 71-77.

173 266. Miller, K.A., et al., Long-term exposure to air pollution and incidence of cardiovascular events in women. N Engl J Med, 2007. 356(5): p. 447-58. 267. Wilson, P.W., et al., Factors associated with lipoprotein cholesterol levels. The Framingham study. Arteriosclerosis, 1983. 3(3): p. 273-81. 268. Njolstad, I., E. Arnesen, and P.G. Lund-Larsen, Smoking, serum lipids, blood pressure, and sex differences in myocardial infarction. A 12-year follow-up of the Finnmark Study. Circulation, 1996. 93(3): p. 450-6. 269. Moffatt, R.J., et al., Acute exposure to environmental tobacco smoke reduces HDL-C and HDL2-C. Prev Med, 2004. 38(5): p. 637-41. 270. Uemura, H., et al., Prevalence of metabolic syndrome associated with body burden levels of dioxin and related compounds among Japan's general population. Environ Health Perspect, 2009. 117(4): p. 568-73. 271. Dirinck, E., et al., Obesity and Persistent Organic Pollutants: Possible Obesogenic Effect of Organochlorine Pesticides and Polychlorinated Biphenyls. Obesity (Silver Spring), 2010. 272. Goncharov, A., et al., High serum PCBs are associated with elevation of serum lipids and cardiovascular disease in a Native American population. Environ Res, 2008. 106(2): p. 226-39. 273. Sergeev, A.V. and D.O. Carpenter, Hospitalization rates for coronary heart disease in relation to residence near areas contaminated with persistent organic pollutants and other pollutants. Environ Health Perspect, 2005. 113(6): p. 756-61. 274. Gustavsson, P. and C. Hogstedt, A cohort study of Swedish capacitor manufacturing workers exposed to polychlorinated biphenyls (PCBs). Am J Ind Med, 1997. 32(3): p. 234-9. 275. Morgan, T.M., et al., Nonvalidation of reported genetic risk factors for acute coronary syndrome in a large-scale replication study. J Am Med Assoc, 2007. 297(14): p. 1551-61. 276. Boffetta, P., et al., False-positive results in cancer epidemiology: a plea for epistemological modesty. J Natl Cancer Inst, 2008. 100(14): p. 988- 95. 277. Lang, I.A., et al., Association of urinary bisphenol A concentration with medical disorders and laboratory abnormalities in adults. J Am Med Assoc, 2008. 300(11): p. 1303-10. 278. Navas-Acien, A., et al., Arsenic exposure and prevalence of type 2 diabetes in US adults. J Am Med Assoc, 2008. 300(7): p. 814-22. 279. Everett, C.J., et al., Association of a polychlorinated dibenzo-p-dioxin, a polychlorinated biphenyl, and DDT with diabetes in the 1999-2002 National Health and Nutrition Examination Survey. Environ Res, 2007. 103(3): p. 413-8. 280. Lee, D.H., et al., Association between serum concentrations of persistent organic pollutants and insulin resistance among nondiabetic adults: results from the National Health and Nutrition Examination Survey 1999-2002. Diabetes Care, 2007. 30(3): p. 622-8.

174 281. Lee, D.H., et al., Relationship between serum concentrations of persistent organic pollutants and the prevalence of metabolic syndrome among non-diabetic adults: results from the National Health and Nutrition Examination Survey 1999-2002. Diabetologia, 2007. 50(9): p. 1841-51. 282. Kitao, Y., et al., A contribution to genome-wide association studies: search for susceptibility loci for schizophrenia using DNA microsatellite markers on 19, 20, 21 and 22. Psychiatr Genet, 2000. 10(3): p. 139-43. 283. Ohnishi, Y., et al., A high-throughput SNP typing system for genome- wide association studies. J Hum Genet, 2001. 46(8): p. 471-7. 284. Duncan David E, Experimental man : what one man's body reveals about his future, your health, and our toxic world2009, Hoboken, NJ: Wiley. 285. Ober, C. and D. Vercelli, Gene-environment interactions in human disease: nuisance or opportunity? Trends Genet, 2011. 27(3): p. 107- 15. 286. Eichler, E.E., et al., Missing heritability and strategies for finding the underlying causes of complex disease. Nat Rev Genet, 2010. 11(6): p. 446-50. 287. National Institute of Child Health and Human Development. Phenylketonuria. 2010 3/24/2010 [cited 2010 8/18]; Available from: http://www.nichd.nih.gov/health/topics/phenylketonuria.cfm. 288. Crowley, J.J., P.F. Sullivan, and H.L. McLeod, Pharmacogenomic genome-wide association studies: lessons learned thus far. Pharmacogenomics, 2009. 10(2): p. 161-3. 289. Khoury, M.J., M.J. Adams, Jr., and W.D. Flanders, An epidemiologic approach to ecogenetics. Am J Hum Genet, 1988. 42(1): p. 89-95. 290. Garrod, A., Alkaptonuria. Lancet, 1902: p. 653-656. 291. Garrod, A., The Inborn Factors in Disease: An Essay1931, Oxford: Clarendon Press. 292. Motulsky, A.G., Drug reactions, enzymes, and biochemical genetics. J Am Med Assoc, 1957. 165(7): p. 835-837. 293. Khoury, M.J., T.H. Beaty, and B. Cohen, Fundamentals of Genetic Epidemiology. 1 ed. Vol. 1. 1993, New York: Oxford University Press. 294. Siemiatycki, J. and D.C. Thomas, Biological models and statistical interactions: an example from multistage carcinogenesis. Int J Epidemiol, 1981. 10(4): p. 383-7. 295. Wang, X., R.C. Elston, and X. Zhu, Statistical interaction in human genetics: how should we model it if we are looking for biological interaction? Nat Rev Genet, 2010. 12(1): p. 74. 296. Kellermann, G., C.R. Shaw, and M. Luyten-Kellerman, Aryl hydrocarbon hydroxylase inducibility and bronchogenic carcinoma. N Engl J Med, 1973. 289(18): p. 934-7.

175 297. Stern, M.C., et al., Polymorphisms in DNA repair genes, smoking, and bladder cancer risk: findings from the international consortium of bladder cancer. Cancer Res, 2009. 69(17): p. 6857-64. 298. Vineis, P., et al., Current smoking, occupation, N-acetyltransferase-2 and bladder cancer: a pooled analysis of genotype-based studies. Cancer Epidemiol Biomarkers Prev, 2001. 10(12): p. 1249-52. 299. Grarup, N. and G. Andersen, Gene-environment interactions in the pathogenesis of type 2 diabetes and metabolism. Curr Opin Clin Nutr Metab Care, 2007. 10(4): p. 420-6. 300. Romao, I. and J. Roth, Genetic and environmental interactions in obesity and type 2 diabetes. J Am Diet Assoc, 2008. 108(4 Suppl 1): p. S24-8. 301. Khoury, M.J. and S. Wacholder, Invited commentary: from genome- wide association studies to gene-environment-wide interaction studies-- challenges and opportunities. Am J Epidemiol, 2009. 169(2): p. 227- 30; discussion 234-5. 302. Hetherington, M.M. and J.E. Cecil, Gene-environment interactions in obesity. Forum Nutr, 2010. 63: p. 195-203. 303. Memisoglu, A., et al., Interaction between a peroxisome proliferator- activated receptor gamma gene polymorphism and dietary fat intake in relation to body mass. Hum Mol Genet, 2003. 12(22): p. 2923-9. 304. Cornelis, M.C., et al., TCF7L2, dietary carbohydrate, and risk of type 2 diabetes in US women. Am J Clin Nutr, 2009. 89(4): p. 1256-1262. 305. Ioannidis, J.P., Why most discovered true associations are inflated. Epidemiology, 2008. 19(5): p. 640-8. 306. Omenn, G.S., Overview of the symposium on public health significance of genomics and eco-genetics. Annu Rev Public Health, 2010. 31: p. 1- 8. 307. Ioannidis, J.P., Commentary: grading the credibility of molecular evidence for complex diseases. Int J Epidemiol, 2006. 35(3): p. 572-8; discussion 593-6. 308. Chen, R., et al., Non-synonymous and synonymous coding SNPs show similar likelihood and effect size of human disease association. PLoS One, 2010. 5(10): p. e13574. 309. Nyholt, D.R., A simple correction for multiple testing for single- nucleotide polymorphisms in linkage disequilibrium with each other. Am J Hum Genet, 2004. 74(4): p. 765-9. 310. Bůžková, P., T. Lumley, and K. Rice, Permutation and Parametric Bootstrap Tests for Gene–Gene and Gene–Environment Interactions. Ann Hum Genet, 2011. 75(1): p. 36-45. 311. Wilson, P.W., et al., Prediction of incident diabetes mellitus in middle- aged adults: the Framingham Offspring Study. Arch Intern Med, 2007. 167(10): p. 1068-74.

176 312. Lyssenko, V., et al., Predictors of and Longitudinal Changes in Insulin Sensitivity and Secretion Preceding Onset of Type 2 Diabetes. Diabetes, 2005. 54(1): p. 166-174. 313. Gauderman J. and J. Morrison, QUANTO - a program to compute power for G x E and G x G studies, 2009: Los Angeles. 314. Vineis, P., A self-fulfilling prophecy: are we underestimating the role of the environment in gene-environment interaction research? Int J Epidemiol, 2004. 33(5): p. 945-946. 315. Hayes, M.G., et al., Identification of type 2 diabetes genes in Mexican Americans through genome-wide association studies. Diabetes, 2007. 56(12): p. 3033-44. 316. Ioannidis, J.P., Population-wide generalizability of genome-wide discovered associations. J Natl Cancer Inst, 2009. 101(19): p. 1297-9. 317. Shu, X.O., et al., Identification of new genetic risk variants for type 2 diabetes. PLoS Genet, 2010. 6(9): p. e1001127. 318. Yamauchi, T., et al., A genome-wide association study in the Japanese population identifies susceptibility loci for type 2 diabetes at UBE2E2 and C2CD4A-C2CD4B. Nat Genet, 2010. 42(10): p. 864-8. 319. Tsai, F.J., et al., A genome-wide association study identifies susceptibility variants for type 2 diabetes in Han Chinese. PLoS Genet, 2010. 6(2): p. e1000847. 320. Unoki, H., et al., SNPs in KCNQ1 are associated with susceptibility to type 2 diabetes in East Asian and European populations. Nat Genet, 2008. 40(9): p. 1098-102. 321. Mailman, M.D., et al., The NCBI dbGaP database of genotypes and phenotypes. Nature genetics, 2007. 39(10): p. 1181-6. 322. Environmental Working Group and Commonweal. EWG || Human Toxome Project. [cited 2009 11/11/2009]; Available from: http://www.ewg.org/sites/humantoxome/.

177