<<

UNIVERSITYOF CALIFORNIA Los Angeles

Computational Genetic Approaches for Understanding the Genetic Basis of Complex Traits

A dissertation submitted in partial satisfaction of the requirements for the degree Doctor of Philosophy in Computer Science

by

Eun Yong Kang

2013 c Copyright by Eun Yong Kang 2013 ABSTRACTOFTHE DISSERTATION

Computational Genetic Approaches for Understanding the Genetic Basis of Complex Traits

by

Eun Yong Kang Doctor of Philosophy in Computer Science University of California, Los Angeles, 2013 Professor Eleazar Eskin, Chair

Recent advances in genotyping and sequencing technology have enabled researchers to collect an enormous amount of high-dimensional genotype data. These large scale genomic data provide unprecedented opportunity for researchers to study and analyze the genetic factors of human complex traits. One of the major challenges in analyzing these high-throughput genomic data is requirements for effective and efficient compu- tational methodologies. In this thesis, I introduce several methodologies for analyzing these genomic data which facilitates our understanding of the genetic basis of complex human traits. First, I introduce a method for inferring biological networks from high- throughput data containing both genetic variation information and gene expression profiles from genetically distinct strains of an organism. For this problem, I use causal inference techniques to infer the presence or absence of causal relationships between yeast gene expressions in the framework of graphical causal models. In particular, I utilize prior biological knowledge that genetic variations affect gene expressions, but not vice versa, which allow us to direct the subsequent edges between two gene expres- sion levels. The prediction of a presence of causal relationship as well as the absence of causal relationship between gene expressions can facilitate distinguishing between

ii direct and indirect effects of variation on gene expression levels. I demonstrate the utility of our approach by applying it to data set containing 112 yeast strains and the proposed method identifies the known “regulatory hotspot” in yeast. Second, I intro- duce efficient pairwise identity by descent (IBD) association mapping method, which utilizes importance sampling to improve efficiency and enables approximation of ex- tremely small p-values. Two individuals are IBD at a locus if they have identical alleles inherited from a common ancestor. One popular approach to find the association be- tween IBD status and disease phenotype is the pairwise method where one compares the IBD rate of case/case pairs to the background IBD rate to detect excessive IBD sharing between cases. One challenge of the pairwise method is computational effi- ciency. In the pairwise method, one uses permutation to approximate p-values because it is difficult to analytically obtain the asymptotic distribution of the statistic. Since the p-value threshold for genome-wide association studies (GWAS) is necessarily low due to multiple testing, one must perform a large number of permutations which can be computationally demanding. I present Fast-Pairwise to overcome the computational challenges of the traditional pairwise method by utilizing importance sampling to im- prove efficiency and enable approximation of extremely small p-values. Using the WTCCC type 1 diabetes data, I show that Fast-Pairwise can successfully pinpoint a gene known to be associated to the disease within the MHC region. Finally, I intro- duce a novel meta analytic approach to identify gene-by-environment interactions by aggregating the multiple studies with varying environmental conditions. Identifying environmentally specific genetic effects is a key challenge in understanding the struc- ture of complex traits. Model organisms play a crucial role in the identification of such gene-by-environment interactions, as a result of the unique ability to observe geneti- cally similar individuals across multiple distinct environments. Many model organism studies examine the same traits but, under varying environmental conditions. These studies when examined in aggregate provide an opportunity to identify genomic loci

iii exhibiting environmentally-dependent effects. In this project, I jointly analyze mul- tiple studies with varying environmental conditions using a meta-analytic approach based on a random effects model to identify loci involved in gene-by-environment interactions. Our approach is motivated by the observation that methods for discov- ering gene-by-environment interactions are closely related to random effects models for meta-analysis. We show that interactions can be interpreted as heterogeneity and can be detected without utilizing the traditional uni- or multi-variate approaches for discovery of gene-by-environment interactions. I apply our new method to combine 17 mouse studies containing in aggregate 4,965 distinct animals. We identify 26 sig- nificant loci involved in High-density lipoprotein (HDL) cholesterol, many of which show significant evidence of involvement in gene-by-environment interactions.

iv The dissertation of Eun Yong Kang is approved.

Adnan Darwiche Christopher J. Lee David Earl Heckerman Aldons J. Lusis Eleazar Eskin, Committee Chair

University of California, Los Angeles 2013

v To my wife Yun Jin, my parents Jai Soo Kang and Su Hee Kim and my two sons Aaron and Luke

vi TABLEOF CONTENTS

1 Introduction ...... 1

2 Detecting the Presence and Absence of Causal Relationships Between Ex- pression of Yeast Genes with Very Few Samples ...... 10

2.1 Background ...... 10

2.2 Methodology ...... 16

2.2.1 Causal Graphs for Genetical Genomics ...... 16

2.2.2 Inference Algorithm Overview ...... 20

2.2.3 Finding potential causal SNPs ...... 20

2.2.4 Finding causal relationships between genes ...... 21

2.2.5 Distinguishing between direct and indirect effects of variation 24

2.2.6 Excluding causal relationships between genes ...... 25

2.3 Results ...... 27

2.4 Discussion ...... 33

2.5 Figures ...... 35

2.6 Tables ...... 39

3 Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features ...... 43

3.1 Introduction ...... 43

3.1.1 Background ...... 43

3.1.2 Previous Approaches and Their Limitations ...... 44

vii 3.1.3 Proposed Approach ...... 47

3.2 Posterior Computation of Graphical Features ...... 48

3.3 Previously Proposed Method ...... 50

3.4 Novel Proposed Method ...... 53

3.4.1 Novel DP Algorithm for Computing Posteriors of All Adja- cency Features ...... 54

3.4.2 Novel DP Algorithm for Computing Posteriors of All V-Structure Features ...... 54

3.5 Experiments ...... 56

3.6 Conclusion ...... 59

4 Increasing Association Mapping Power and Resolution in Mouse Genetic Studies through the Use of Meta-analysis for Structured Populations .... 61

4.1 INTRODUCTION ...... 61

4.2 MATERIALS AND METHODS ...... 64

4.2.1 Association Studies ...... 64

4.2.2 Association Studies in Structured Populations ...... 66

4.2.3 Meta-Analysis in Structured Populations ...... 67

4.2.4 Mouse Association Data ...... 69

4.3 RESULTS ...... 70

4.3.1 Combining the HMDP with an F2 cross increases power . . . 70

4.3.2 Meta-analysis leads to an increase in resolution over HMDP and F2 mapping ...... 71

4.3.3 Application to Bone Mineral Density ...... 72

viii 4.3.4 Application to HDL cholesterol ...... 73

4.4 DISCUSSION ...... 73

4.5 FIGURES ...... 76

5 Fast Pairwise IBD Association Testing in Genome-wide Association Stud- ies ...... 81

5.1 Background ...... 81

5.2 Methods ...... 83

5.2.1 IBD graph ...... 83

5.2.2 Pairwise methods for IBD association mapping ...... 84

5.2.3 Permutation test ...... 86

5.2.4 Fast-Pairwise ...... 86

5.2.5 IBD-degreetype ...... 87

5.2.6 Substitution strategy ...... 91

5.2.7 Sampling with replacement ...... 92

5.2.8 Sampling without replacement ...... 94

5.2.9 P-value calculation ...... 95

5.2.10 Adjusting for population structure ...... 96

5.3 Results ...... 96

5.3.1 Efficiency ...... 96

5.3.2 Accuracy ...... 98

5.3.3 Application to WTCCC type 1 diabetes data ...... 101

5.4 Discussion ...... 102

ix 5.5 Supplementary Section ...... 104

6 Meta-analysis identifies Gene-by-environment Interactions as Demonstrated in a Study of 4,965 Mice ...... 107

6.1 Background ...... 107

6.2 Results ...... 110

6.2.1 Discovering environmentally-specific loci using meta-analysis 110

6.2.2 The connection between random effects meta-analysis and gene- by-environment interactions ...... 113

6.2.3 Gene-by-environment interactions are prevalent in mouse as- sociation studies ...... 115

6.2.4 Power of meta-analysis for detecting gene-by-environment in- teractions ...... 116

6.2.5 Application to 17 mouse HDL studies ...... 118

6.3 Discussion ...... 120

6.4 Methods and Materials ...... 122

6.4.1 Standard study design for testing gene-by-environment inter- actions ...... 122

6.4.2 Multivariate interactions model ...... 123

6.4.3 Standard meta-analysis approach ...... 123

6.4.4 Relation between gene-by-environment interactions and meta- analysis ...... 126

6.4.5 Controlling for population structure within studies ...... 127

6.4.6 Meta-analysis of studies with population structure ...... 129

x 6.4.7 Identifying studies with an effect ...... 129

6.5 Tables ...... 132

6.6 Figures ...... 133

6.7 Supplementary Legends ...... 139

6.8 Supplementary Materials ...... 173

6.8.1 Power comparison between Meta-GxE, fixed effects meta-analysis and Heterogeneity testing approach ...... 173

6.8.2 Power and type I error comparison between Meta-GxE ap- proach and traditional Wald test based GxE testing approach . 173

6.8.3 Details of 17 HDL mouse studies ...... 174

7 Genome Wide Association Study for Age-related Hearing Loss in the Mouse: A Meta-analysis ...... 178

7.1 Introduction ...... 178

7.2 Results ...... 180

7.2.1 New loci discovered through random-effects meta-analysis . . 180

7.2.2 Interpreting meta-analysis using posterior probabilities . . . . 181

7.2.3 Novel loci identified in meta-analysis ...... 185

7.3 Discussion ...... 188

7.4 Methods and Materials ...... 189

7.4.1 Controlling for population structure within studies ...... 189

7.5 Supplementary Materials ...... 195

7.5.1 Mouse hearing association results ...... 196

7.5.2 PM-plots for significant locus ...... 197

xi 8 A Novel Meta-analysis Approach for Genome-wide Association Studies with Sex-specific Effects...... 206

8.1 Background ...... 206

8.2 Results ...... 208

8.2.1 Applying meta-analysis to handle sex-specific effects . . . . . 208

8.2.2 The presence of background GxS interactions explains the ad- vantage of the sex-differentiated meta-analysis approach . . . 211

8.2.3 Statistical power comparison between FE, UFE, GWAMA, WG- WAMA, SST and RE2 approaches for Sex-differentiated meta- analysis ...... 214

8.3 Discussion ...... 219

8.4 Methods ...... 220

8.4.1 Fixed effects model meta analysis for sex-differentiated analysis220

8.4.2 Unweighted fixed effects model meta analysis for sex-differentiated analysis ...... 220

8.4.3 SST approach for sex-differentiated analysis ...... 221

8.4.4 GWAMA approach for sex-differentiated analysis ...... 221

8.4.5 Weighted GWAMA approach for sex-differentiated analysis . 222

8.4.6 Meta-sex approach for sex-differentiated analysis ...... 222

8.4.7 Controlling for population structure within studies ...... 223

8.4.8 Computation of statistical power of the methods ...... 225

8.4.9 Simulating GxS interactions ...... 226

8.4.10 Computing heritability of GxS interactions ...... 227

xii 8.5 Supplementary Materials ...... 229

9 Conclusion ...... 240

References ...... 246

xiii LISTOF FIGURES

2.1 Complete causal network in yeast with the nine regulatory hotspots colored. Circles designate regulators, squares designate targets and diamonds designate genes that are both regulators and targets. The spring-embedded view of the causal network shows that some hotspots, hotspots 2 (red), 3 (bright green), and 9 (light blue), overlap well with the hub like structures of the network where regulators are positioned in the middle and targets surround the causal hub...... 36

2.2 Complete causal network in yeast with the nine regulatory hotspots colored. Circles designate regulators, squares designate targets and diamonds designate genes that are both regulators and targets. Causal network grouped by hotspot shows that some regulators and targets (indicated by gray) are not part of known regulatory hotspots...... 37

2.3 Possible causal graphs relating a triplet considering a SNP S with the

level of gene expression for genes Ei and Ej. Bidirected edges denote hidden common causes. (a) Nine possible causal models consistent

with S being a causal ancestor of Ei and Ej (models H4 through H9 are indistinguishable from observations of the triplet). (b) Four possible

causal models consistent with S being a causal ancestor of Ei while

being uncorrelated with Ej...... 38

3.1 Three graphical models from a Markov equivalent class ...... 46

3.2 (a) Directed edge approach (b) Adjacency + v-structure approach (c) Maximum likelihood approach ...... 58

xiv 4.1 Mapping power is increased by combining populations using meta- analysis We performed simulations assuming a background genetic effect of 25%. Power was calculated as the percentage of associa- tions detected at a given level of significance for SNPs simulated to be causal. The meta-analysis method is shown to provide increased power at all effect sizes...... 77

4.2 The combined mapping result has higher resolution than that of the HMDP or F2 mapping. The distribution of distances from the true causal SNP to the most significant association are shown in units of megabases. We considered peak associations which are within 15mb of the true causal SNP. As expected, the HMDP has much higher reso- lution than the F2 cross. The combined result achieves an even higher resolution...... 78

4.3 Meta-analysis results in increased significance and increased reso- lution for two loci known to be associated with BMD. Two loci, one on chr 4 and one on chr 7, were previously found to be associated with BMD (Bmd7 and Bmd41 respectively) [FNG09]. After applying meta- analysis, we found that the peak associated SNPs for both of these loci had increased significance with respect to the F2 and HMDP mapping panels. We also found that using significant SNPs to define intervals within these loci, resulted in tighter confidence intervals, indicating increased resolution...... 79

xv 4.4 Meta-analysis increases significance of known association. We com- pare the association mapping results obtained from the HMDP, F2 cross and meta-analysis for HDL cholesterol on chromosome 1. We have previously shown that the Apoa2 gene underlies the chromosome 1 HDL locus [WHQ93]...... 80

6.1 Application of Meta-GxE to Apoa2 locus. The forest plot (A) shows heterogeneity in the effect sizes across different studies. The PM-plot (B) predicts that 7 studies have an effect at this locus, even though only 1 study (HMDP-chow(M)) is genome-wide significant with P-value. 133

6.2 The prevalence of heterogeneity in effect size of significant loci. Each dot represents association between SNPs and HDL phenotype from ap- plying random effects based meta-analysis approach. Dots with larger I2 value represents the existence of more heterogeneity at the locus between studies. The distribution of the heterogeneity statistic for sig- nificant SNPs (red dots) in the meta analysis is skewed toward higher heterogeneity while the non-significant SNPs are much less skewed. . 134

6.3 Power of mouse meta-analysis to identify gene-by-environment inter- actions in 4,965 animals from 17 studies under varying mean effect sizes and the per study variance of the effect size which corresponds to gene-by-environment effects...... 135

6.4 Power of (a) random-effect, (b) fixed-effect meta-analysis and (c) het- erogeneity meta-analysis methods as a function of the effect size and the strength of the interaction effect (heterogeneity). (d) shows a com- parison of the three methods with the color corresponding to the method with the highest power...... 136

xvi 6.5 Manhattan plots showing the results of Meta-GxE applied to (a) 17 HDL studies, (b) 9 HDL studies consisting only of male animals and (c) 8 studies consisting only of female animals...... 137

6.6 Peak SNP in chromosome 8 shows interesting gene-by-environment interactions between sex × mutation-driven LDL levels...... 138

6.7 Power comparison between random-effect, fixed-effect meta-analysis and heterogeneity testing ...... 143

6.8 The association result of the fixed effects meta analysis...... 144

6.9 Power comparison between random-effect meta-analysis and traditional wald test based approach ...... 145

6.10 Power comparison between heterogeneity testing approach and tradi- tional wald test based approach ...... 146

6.11 Forest plot and PM-plot for Chr1:64752822 locus ...... 147

6.12 Forest plot and PM-plot for Chr1:107271282 locus ...... 148

6.13 Forest plot and PM-plot for Chr1:171199523 locus ...... 149

6.14 Forest plot and PM-plot for Chr2:77837584 locus ...... 150

6.15 Forest plot and PM-plot for Chr2:134421733 locus ...... 151

6.16 Forest plot and PM-plot for Chr3:32944259 locus ...... 152

6.17 Forest plot and PM-plot for Chr3:76066632 locus ...... 153

6.18 Forest plot and PM-plot for Chr3:107430396 locus ...... 154

6.19 Forest plot and PM-plot for Chr3:143466942 locus ...... 155

6.20 Forest plot and PM-plot for Chr4:131925523 locus ...... 156

6.21 Forest plot and PM-plot for Chr5:119034507 locus ...... 157

6.22 Forest plot and PM-plot for Chr8:46903188 locus ...... 158

xvii 6.23 Forest plot and PM-plot for Chr8:64150094 locus ...... 159

6.24 Forest plot and PM-plot for Chr8:84073148 locus ...... 160

6.25 Forest plot and PM-plot for Chr9:101972687 locus ...... 161

6.26 Forest plot and PM-plot for Chr10:21399819 locus ...... 162

6.27 Forest plot and PM-plot for Chr10:90146088 locus ...... 163

6.28 Forest plot and PM-plot for Chr11:69906552 locus ...... 164

6.29 Forest plot and PM-plot for Chr11:114083173 locus ...... 165

6.30 Forest plot and PM-plot for Chr14:33632464 locus ...... 166

6.31 Forest plot and PM-plot for Chr15:21194226 locus ...... 167

6.32 Forest plot and PM-plot for Chr15:59860191 locus ...... 168

6.33 Forest plot and PM-plot for Chr17:46530712 locus ...... 169

6.34 Forest plot and PM-plot for Chr18:82240606 locus ...... 170

6.35 Forest plot and PM-plot for Chr19:3319089 locus ...... 171

6.36 Forest plot and PM-plot for ChrX:151384614 locus ...... 172

7.1 Mouse hearing association results from random effect meta analysis . 182

7.2 (A) Forest plot and (B) PM-plot for Chr11:120818214 locus (8khz) . 185

7.3 (A) Forest plot and (B) PM-plot for Chr11:120818214 locus (16khz) 186

7.4 (A) Forest plot and (B) PM-plot for Chr11:120818214 locus (32khz) 186

7.5 The association result of three studies for 8khz mouse hearing. . . . . 196

7.6 The association result of three studies for 16khz mouse hearing. . . . 197

7.7 The association result of three studies for 32khz mouse hearing. . . . 198

7.8 (A) Forest plot and (B) PM-plot for Chr1:196270162 locus (8khz) . . 200

xviii 7.9 (A) Forest plot and (B) PM-plot for Chr8:130247700 locus (8khz) . . 200

7.10 (A) Forest plot and (B) PM-plot for Chr11:120818214 locus (8khz) . 201

7.11 (A) Forest plot and (B) PM-plot for Chr1:196270162 locus (16khz) . 201

7.12 (A) Forest plot and (B) PM-plot for Chr8:130247700 locus (16khz) . 202

7.13 (A) Forest plot and (B) PM-plot for Chr9:30287464 locus (16khz) . . 202

7.14 (A) Forest plot and (B) PM-plot for Chr11:120818214 locus (16khz) 203

7.15 (A) Forest plot and (B) PM-plot for Chr1:196345862 locus (32khz) . 203

7.16 (A) Forest plot and (B) PM-plot for Chr8:131366159 locus (32khz) . 204

7.17 (A) Forest plot and (B) PM-plot for Chr11:120818214 locus (32khz) 204

7.18 (A) Forest plot and (B) PM-plot for Chr18:24320393 locus (32khz) . 205

8.1 Power of (a) random-effect meta-analysis (RE2), (b) fixed-effect meta- analysis (FE) and (c) sex-as-covariate methods (CV) as a function of the GxS heritability and the heterogeneity between the male and the female study effect size. (d) shows a comparison of the three methods with the color corresponding to the method with the highest power. . . 214

8.2 Statistical power comparison between FE, GWAMA, SST and RE2 ap- proaches while varying the effect sizes of males and females study. (a) The power comparison under the assumption of balanced number of individuals (b) and when there are twice as many females as males. The diagonal line shows the points where the effect sizes between males and females are equal...... 217

xix 8.3 Statistical power comparison between FE, UFE, WGWAMA, GWAMA, and SST approaches while varying the effect sizes of male and female study. (a) The power comparison under the assumption of balanced number of individuals (b) and when there are twice as many females as males. The diagonal line shows the points where the effect sizes between males and females are equal...... 218

S1 QQplots for BMI phenotype ...... 230

S2 QQplots for Crp phenotype ...... 231

S3 QQplots for Dia phenotype ...... 232

S4 QQplots for Glu phenotype ...... 233

S5 QQplots for Hdl phenotype ...... 234

S6 QQplots for height phenotype ...... 235

S7 QQplots for ins phenotype ...... 236

S8 QQplots for ldl phenotype ...... 237

S9 QQplots for sys phenotype ...... 238

S10 QQplots for tg phenotype ...... 239

xx LISTOF TABLES

2.1 Summary Statistics For Different Likelihood Thresholds ...... 40

2.2 Regulatory Hotspots and Corresponding Regulators ...... 41

2.3 Significantly Enriched Processes For Causal and Not Causal Genes . . 42

5.1 Running time for pairwise IBD association testing for the WTCCC whole genome data ...... 97

6.1 17 HDL studies for meta analysis ...... 140

6.2 26 significant loci identified by applying Meta-GxE analysis . . . . . 141

6.3 False Positive Rate of RE versus Traditional Wald Test Based Ap- proaches at Tresholds of Increasing Significance ...... 142

6.4 False Positive Rate of HE versus Traditional Wald Test Based Ap- proaches at Tresholds of Increasing Significance ...... 142

7.1 (a) Three significant loci were identified in the meta-analysis for the 8khz tone burst. (b) Four significant loci were identified in the meta- analysis for the 16khz tone burst. (c) Three significant loci were identi- fied in the meta-analysis for the 8khz tone burst. E denotes the number of studies with an effect on the 8 kHz phenotype. N denotes the num- ber of studies without an effect. A denotes the number of studies with an ambiguous effect...... 183

7.2 Genetic factors that contribute to age-related hearing loss in inbred mouse strains. CS=chromosome substitution; RI=recombinant inbred. 199

xxi 8.1 The comparison of genomic control λ for 5 approaches across 10 NFBC phenotypes...... 210

8.2 Comparison of 4 association mapping approaches: Fixed effect meta- analysis (FE), random effect meta-analysis (RE2), sex-as-covariate (Co- variate), combined approach (Combined) ...... 212

xxii ACKNOWLEDGMENTS

First and foremost, I thank my advisor, Eleazar Eskin, for his support and guidance in my research. He always trusted me and encouraged me, even when I was depressed, so that I could continue to focus on my research with enthusiasm. It is amazing that he was always available when I needed his help. In the initial stages of my research, I found it difficult to familiarize myself with genetics because I did not have a strong background in the subject. Through experiencing a wide variety of genetics projects, I learned valuable lessons and now feel that that there are so many interesting and exciting areas to study and discover in the field of genetics. I also had lot of fun with Eleazar both when carrying out research and outside of the laboratory.

I would like to thank rest of my committee; Adnan Darwiche, Christopher Lee, David Heckerman, and Aldons Jake Lusis. I especially thank David Heckerman for his mentorship and allowing me the opportunity to work on FastLmm projects. I also had lots of fun when I was working at Microsoft Research. I especially thank Jake Lusis for his support, great guidance and mentorship during my research. I had a great opportunity to work with Jake and his lab members on several very interesting projects; this is one of the greatest benefits to me for the development of my research career.

I would like to thank all of my current and previous colleagues in Zarlab. I espe- cially thank Hyun Min Kang for his mentorship and guidance in my research. Most of all, Hyun introduced me to Eleazar, giving me the opportunity to join the Zarlab. I met Hyun on the bus on the way home from UCLA. Although I barely knew him, he shared his research and suggested that I join Zarlab. I thank Chun (Jimmie) Ye for teaching me statistics and several useful R commands during the first year of my PhD in Zarlab. I especially thank Ilya Shpitser for teaching me graphical modeling theory and causality.

xxiii I can always feel his enthusiasm to the research, while I was working with him. I es- pecially thank Buhm Han for his insightful suggestions, for sharing scientific insights and for being a good mentor for my research. I especially thank Richard Davis for his efforts in the Meta-GxE project. We spent lots of time together interpreting the Meta- GxE results and he taught me many interesting biological knowledge behind 17 mouse HDL studies. I also thank the rest of my current and previous lab colleagues Noah Zaitlen, Sean O’Rourke, Emrah Kostem, Chris Jones, Olivera Grujic, Nick Furlotte, Jae-hoon Sul, Dan He, Warin Isvilanonda, Serghei Mangul, Farhad Hormozidiari, Jong Wha (Joanne) Joo, Wen-Yun Yang, and Zhanyong Wang, for sharing scientific insights and our enjoyable time together. I especially thank Christoph Lippert from Microsoft Research for teaching me mixed models and useful linear algebra. I also thank all the rest of my colleagues in Microsoft Research; Jennifer Listgarten, Carl M Kadie, and Robert I. Davidson. I had a lot of fun when I was working at Microsoft Research. I also thank all of my great collaborators; Sagiv Shiffman, Eyal Ben-David, Rick A. Friedman, Jeffry Ohmen, Lisa Martin, Soumya Raychaudhuri, Paul I. W. de Bakker, Atila Van Nas, Elin Org, Thomas Drake, and Charles Farber.

My church family in Olympic ward always encourages and inspires me, both tem- porally and spiritually. I especially thank my parents, Jai Soo Kang and Su Hee Kim, my sister, Eunchai Kang, and my in-laws for their endless love and support. I would like to thank my wife, Yun Jin Kang, and two sons, Aaron and Luke, for their presence, love, encouragement, comfort and support. I especially thank my wife, Yun Jin Kang for waiting so long for my graduation. Without her help I would never be in this state; happy, healthy, full of gratitude, and filled with love. Finally and most importantly, I would like to thank my Heavenly Father and his son Jesus Christ, my Lord, Savior and Redeemer, for his everlasting love, grace, mercy and guidance in my life.

xxiv VITA

2005 Bachelor of Science in Computer Science, University of Utah, Salt Lake City, Utah.

2002-2004 Teaching Assistant, University of Utah, Salt Lake City, Utah

2003-2004 Software Engineer, Global Knowledge Management Center- University of Utah, Salt Lake City, Utah

2004-2007 Research Assistant, University of Utah, Salt Lake City, Utah

2007 Master of Science in Computer Science, University of Utah, Salt Lake City, Utah.

2008-2009 Teaching Assistant, University of California Los Angeles, Los An- geles, California

2012-2012 Research Intern, Microsoft Research, Los Angeles, California

2008-2013 Research Assistant, University of California Los Angeles, Los An- geles, California

PUBLICATIONS

xxv Eun Yong Kang*, Buhm Han*, Nicholas Furlotte*, Jong Wha Joo, Diana, Shih, Richard Davis, Jake Lusis, Eleazar Eskin. “Meta-analysis identifes gene-by-environment interactions as demonstrated in a study of 4,323 mouse samples”, in PLOS Genetics, 2013.

Buhm Han*, Eun Yong Kang*, Soumya Raychaudhuri, Paul I. W. de Bakker and Eleazar Eskin, “Fast Pairwise IBD Association Testing in Genome-wide Association Studies”, in , 2013.

Christoph Lippert, Gerald Quon, Eun Yong Kang, Carl M Kadie, Jennifer Listgarten and David Heckerman, “The benefits of selecting phenotype-specific variants for ap- plications of mixed models in genomics”, in Scientific Reports, April 2013.

Jennifer Listgarten*, Christoph Lippert*, Eun Yong Kang, Jing Xiang, Carl M Kadie and David Heckerman, “A powerful and efficient set test for genetic markers that han- dles confounders”, in Bioinformatics, April 2013.

Anatole Ghazalpour, Christoph D Rau, Charles R Farber, Brian J Bennett, Luz D Orozco, Atila van Nas, Calvin Pan, Hooman Allayee, Simon W Beaven, Mete Civelek, Richard C Davis, Thomas A Drake, Rick A Friedman, Nick Furlotte, Simon T Hui, J David Jentsch, Emrah Kostem, Hyun Min Kang, Eun Yong Kang, Jong Wha Joo, Vyacheslav A Korshunov, Rick E Laughlin, Lisa J Martin, Jeffrey D Ohmen, Brian W Parks, Matteo Pellegrini, Karen Reue, Desmond J Smith, Sotirios Tetradis, Jessica Wang, Yibin Wang, James N Weiss, Todd Kirchgessner, Peter S Gargalovic, Eleazar Eskin, Aldons J Lusis, RenOe˝ C Leboeuf, “Hybrid mouse diversity panel: a panel of inbred mouse strains suitable for analysis of complex genetic traits”, in Mammalian

xxvi Genome, October 2012.

Nicholas A. Furlotte*, Eun Yong Kang*, Atila Van Nas, Charles R. Farber, Aldons J. Lusis and Eleazar Eskin, “Increasing association power and resolution in mouse ge- netic studies through the use of meta-analysis for structured populations”, in Genetics, April 2012.

Andrew Kirby, Hyun Min Kang, Claire M. Wade, Chris J. Cotsapas, Emrah Kostem, Buhm Han, Nick Furlotte, Eun Yong Kang, Manuel Rivas, Molly A. Bogue, Kelly A. Frazer, Frank M. Johnson, Erica J. Beliharz, David R. Cox, Eleazar Eskin, Mark J. Daly, “Fine Mapping in 94 Inbred Mouse Strains Using a High-density Haplotype Resource”, in Genetics, July 2010.

Eun Yong Kang, Chun Ye, Ilya Shpitser and Eleazar Eskin, “Detecting the presence and absence of causal relationships between expression of yeast genes with very few samples”, in Journal of Computational Biology, March, 2010.

Eun Yong Kang, Ilya Shpitser and Eleazar Eskin, “Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features”, in Proceedings of the 24th AAAI Conference on Artificial Intelligence (AAAI2010), Atlanta, GA: July 11-15, 2010.

Eun Yong Kang, Ilya Shpitser, Chun Ye and Eleazar Eskin, “Detecting the presence and absence of causal relationships between expression of yeast genes with very few

xxvii samples”, in Proceedings of the Thirteenth Annual Conference on Research in Com- putational Molecular Biology (RECOMB2009), Tucson, Arizona: May 18-21, 2009.

Eun Yong Kang, Hyun Min Kang, Ilya Shpitser, Chun Ye and Eleazar Eskin, “De- tecting the presence and absence of causal relationships between expression of yeast genes with very few samples”, in Proceedings of NIPS 2008 Workshop on Machine Learning in Computational Biology (NIPS2008), Whistler, Canada: December 12th, 2008.

Hoa Nguyen, Eun Yong Kang and Juliana Freire., “Automatically Extracting Form La- bels”, In Proceedings of the IEEE 24th International Conference on Data Engineering (ICDE2008), April 7-12, 2008.

xxviii CHAPTER 1

Introduction

The human genome contains the complete set of genetic information for humans. This information is encoded by DNA sequences within 23 chromosomes comprised of 3 billion nucleotides. Of the 3 billion nucleotides that make up the human genome, only a small fraction of nucleotide sequences differ between individuals. The most com- mon type of individual DNA variation is the single nucleotide polymorphism (SNP), which represents a variation in a single nucleotide at a particular genomic position, mainly due to a single point mutation. If a particular genetic variant has a functional property that affects a phenotype, such as susceptibility to a disease, it is referred to as causal with respect to that phenotype. There has been a huge effort to discover the causal genetic variants of many phenotypes by the human genetics research commu- nity. The search for variants related to human disease are particularly important and unfortunately remain amongst the most elusive to discover.

One major driver for the elucidation of the genetic basis of human complex dis- eases is the advancement in genotyping and sequencing technology. These techno- logical advances have enabled researchers to collect an enormous amount of high- throughput genotype data. Large scale international research efforts, such as Hapmap or 1000 genome project, have characterized a detailed catalogue of genetic variation between individuals from a number of different ethnic groups. Due to the comple- tion of these human genome projects, it is possible to gain significant insights into the nature of human genetic diversity. These include changes to individual nucleotides

1 (SNPs), changes in the number of copies of a segment of DNA (copy number varia- tions; CNVs), and other structural changes such as inversions and translocations.

The challenges of discovering the genetic basis of human complex traits including disease is that this process is inherently computationally demanding due to the high- dimensionality of the human genetic data. To tackle these computational challenges in analyzing these large scale human genetic data, efficient and effective computa- tional genetic approaches are necessary. Computational genetics refers to the field of study that utilizes computational and statistical analyses to understand causal bio- logical mechanisms by analyzing large scale genetic data such as whole genome se- quencing data, exome sequencing data, genotyping data, gene expression arrays, or transcriptome sequencing data. To analyze each of these data and uncover important biological mechanisms requires an understanding of the fundamental characteristics of human genetic data through human genome projects, and effective and efficient com- putational methodologies, each of which are tailored to a particular goal of interest.

One popular computational genetic approach to analyze large scale human genetic data is a genome-wide association study (GWAS). Genome-wide association studies systematically search for disease susceptibility loci by computing the correlation be- tween common genetic variants and phenotypes such as disease status or various quan- titative traits. The method searches the genome for genetic variations such as SNPs that occur more frequently in people with a particular disease than in people with- out the disease. The fundamental assumption of genome-wide association studies is that the SNP changes typically show a local correlation structure within a genomic segment. The tendency of a high correlation between physically close SNPs is due to linkage disequilibrium (LD) induced by the high likelihood of linkage between the two loci during recombination. In past decades, genome-wide association studies have re- vealed thousands of disease-associated loci and have provided insights into the allelic

2 architecture of complex traits. High-density genome-wide association (GWA) studies have identified a large number of disease susceptible loci that contribute to complex human diseases. These findings are providing valuable clues to the allelic architecture of complex traits.

Although individual GWAS has identified many disease-associated loci, all of these identified loci can only explain some of the variance in the phenotype of interest. This lack of the ability to explain the full variance of the phenotype is termed the “missing heritability” problem. One possible reason for this phenomenon is that performing GWAS with a single cohort is lacks the power to identify all of the disease-associated loci, particularly genetic variation with a small effect size or low minor allele fre- quency. The popular approach to address the lack of statistical power of a single GWA study is meta-analysis. In the meta analysis approach, the summary statistics of many independent studies are often combined, such as effect size and standard error of ef- fect size estimates from each of the studies. This also helps to overcome the limitation of sharing individual level data. Additionally, the combined information from a large number of independent studies leads to increasing statistical power and better con- trol of false positives. Thus meta-analysis has become the most popular approach for association studies and has led to the discovery of new disease-associated genetic loci.

In this thesis, I introduce several statistical and computational approaches for the efficient and effective genetic studies. Here I will give brief background of each prob- lem and provide an intuitive explanation of the contribution.

In Chapter 2, I present the method to infer biological networks from high-throughput data, which particularly contains both genetic variation information and gene expres- sion profiles from a set of genetically distinct group of individuals. Inference of bio- logical networks from high-throughput genomic data is an important yet very challeng- ing problem in genetical genetics and systems biology. Gene expression and genetic

3 variation data from high-throughput technologies provides unprecedented opportunity, which allow us to infer the underlying biological networks that govern how individual genetic variation mediates gene expression and how genes regulate and interact with each other. the method presented in this chapter try to infer much richer information on network between gene expressions compared to coexpression network, which provide only correlation between gene expressions. In particular, infer the presence or absence of causal relationships between yeast gene expressions in the framework of graphical causal models. In addition, our method can detect the absence of causal relationships and can distinguish between direct and indirect effects of variation on a gene expres- sion level. I evaluate our method using a well studied dataset consisting of both genetic variations and gene expressions collected over randomly segregated yeast strains. Our predictions of causal regulators, genes that control the expression of a large number of target genes, are consistent with previously known experimental evidence.

In Chapter 3, I present efficient dynamic programming algorithm for comput- ing posterior probability of causal graphical feature such as directed edges and v- structures. There have been many efforts to identify causal graphical features such as directed edges between random variables from observational data. Tian et al. [TH09] proposed a dynamic programming algorithm which computes marginalized posterior probabilities of directed edge features over all the possible structures in O(n3n) time when the number of parents per node is bounded by a constant, where n is the number of variables of interest. However the main drawback of this approach is that deciding a single appropriate threshold for the existence of the directed edge feature is difficult due to the scale difference of the posterior probabilities between the directed edges forming v-structures and the directed edges not forming v-structures. I show that com- puting posterior probabilities of both adjacencies and v-structures is necessary and more effective for discovering causal graphical features, since it allows us to find a single appropriate decision threshold for the existence of the feature that is tested.

4 For efficient computation, I devise a novel dynamic programming algorithm which n(n−1) n−1 computes the posterior probabilities of all of 2 adjacency and n 2 v-structure features in O(n33n) time.

In Chapter 4, I introduce the meta analysis approach to combine multiple studies, where each study contains population structure. In meta-analysis, the results of sep- arate studies are combined to obtain an aggregate result. This type of analysis has been popular in human genome-wide association studies due to issues of data privacy and the potential for reduced computational burden and increased statistical power [BFJ08]. However, one problem that has not been effectively addressed is how to combine multiple studies using meta-analysis when each study has some degree of population structure, a well-known problem in association studies in single popula- tions [KZW08]. For single populations, there are effective algorithms and procedures for dealing with issues related to population structure [KZW08, LLL11, ZSZ12], but it is not clear how these methods may be adapted to meta-analysis. Motivated by a specific problem in mouse genetics, I will introduce a method for combing separate study populations when each study contains population structure. The method works by correcting each study separately and then combining the studies while considering the degree of population structure in each. I show that this method results in increased power and increased association resolution when combining two separate mouse pop- ulations.

In Chapter 5, I introduce efficient pairwise identity by descent (IBD) association mapping method, which utilizes importance sampling to improve efficiency and enable approximation of extremely small p-values. Two individuals are IBD at a locus if they have identical alleles inherited from a common ancestor. Recently investigators have proposed state-of-the-art identity-by-descent (IBD) detection methods to detect IBD segments between purportedly unrelated individuals. The IBD information can then

5 be used for association testing in genetic association studies. One popular approach to find the association between IBD status and disease phenotype is the pairwise method where one compares the IBD rate of case/case pairs to the background IBD rate to detect excessive IBD sharing between cases. One challenge of the pairwise method is computational efficiency. In the pairwise method, one uses permutation to approxi- mate p-values because it is difficult to analytically obtain the asymptotic distribution of the statistic. Since the p-value threshold for genome-wide association studies (GWAS) is necessarily low due to multiple testing, one must perform a large number of permu- tations which can be computationally demanding. I present Fast-Pairwise to overcome the computational challenges of the traditional pairwise method by utilizing impor- tance sampling to improve efficiency and enable approximation of extremely small p-values. Fast-pairwise takes only days to complete a genome-wide scan. In the ap- plication to the WTCCC type 1 diabetes data, Fast-Pairwise successfully fine-maps a HLA gene which is known to cause the disease.

In Chapter 6, I introduce a novel meta analytic approach to identify gene-by- environment interactions by aggregating the multiple studies with varying environ- mental conditions. Identifying environmentally specific genetic effects is a key chal- lenge in understanding the structure of complex traits. Model organisms play a crucial role in the identification of such gene-by-environment interactions, as a result of the unique ability to observe genetically similar individuals across multiple distinct envi- ronments. Many model organism studies examine the same traits but, under varying environmental conditions. For example, in many mouse genetic studies the same traits are examined under different environmental conditions. Specifically, knock-out or diet-controlled mice are often utilized in the study of cholesterol levels. The availabil- ity of these studies presents a unique opportunity to identify genomic loci involved in gene-by-environment interactions as well as those loci involved in the trait independent of the environment. These studies when examined in aggregate provide an opportu-

6 nity to identify genomic loci exhibiting environmentally-dependent effects. In this project, I jointly analyze multiple studies with varying environmental conditions using a meta-analytic approach based on a random effects model to identify loci involved in gene-by-environment interactions. Our approach is motivated by the observation that methods for discovering gene-by-environment interactions are closely related to random effects models for meta-analysis. I show that interactions can be interpreted as heterogeneity and can be detected without utilizing the traditional uni- or multi-variate approaches for discovery of gene-by-environment interactions. I also show that our ap- proach is more powerful than the traditional uni- or multi-variate gene-by-environment association approach which assumes knowledge of the co-variates involved in gene- by-environment interactions. The traditional uni- or multi-variate approach requires prior knowledge about environmental factors including kinds of variable (e.g. sex, age, gene knockouts) and encoding of the variables (e.g. binary values or continuous values) due to the requirement of explicit modeling of environmental variables in the model. However, the proposed approach does not require such knowledge to iden- tify gene-by-environment interaction. I apply our new method to combine 17 mouse studies containing in aggregate 4,965 distinct animals. I identify 26 significant loci in- volved in High-density lipoprotein (HDL) cholesterol, many of which show significant evidence of involvement in gene-by-environment interactions.

In Chapter 7, I analyze the mice age-related hearing loss data with a meta-analytic approach. Age-related hearing impairment (ARHI) is characterized by a symmetric sensorinueral hearing loss primarily in the high frequencies and individuals have dif- ferent levels of susceptibility to ARHI. Heritability studies have shown that the sources of this variance are both genetic and environmental, with approximately half of the variance attributable to hereditary factors [HT10]. Only a limited number of large- scale association studies for ARHI have been undertaken in humans, to date. An alter- nate and complementary approach to these human studies is through the use of mouse

7 models. Advantages of mouse models include that the environment can be more care- fully controlled, measurements can be replicated in genetically identical animals, and the proportion of the variability explained by genetic variation is increased. Complex traits in mouse strains have been shown to have higher heritability and genetic loci often have stronger effects on the trait compared to humans. Motivated by these ad- vantages, we have performed the first genome-wide association study of its kind in the mouse by combining several data sets in a meta-analysis to identify loci associated with age-related hearing loss. I identified five genome-wide significant loci (< 10−8). One of these loci generated a narrow peak in the region of Ahl8. These data con- firm the utility of this approach and provide new high-resolution mapping information about variation within the mouse genome associated with hearing loss.

In Chapter 8, I introduce the powerful random effect based meta-analytic approach for genome-wide association studies with sex-specific effects. The prevalence of sex- specific effects complicates the analysis of association studies consisting of both males and females. The traditional approach to address this issue in genome wide associa- tion is to include sex as a covariate in the statistical model when performing associ- ation analysis. Unfortunately, this approach ignores the potential for a difference in effect sizes between the two sexes and thus may lead to a loss in power. Further- more, genome-wide gene-by-sex effects introduce additional background phenotypic variance, which can induce additional power loss. I present a novel meta-analytic ap- proach for the analysis of genome-wide association studies consisting of both males and females. In this approach, males and females are analyzed separately and the results are combined using a random effects meta-analysis approach allowing for a difference in effect sizes between sexes. I show that by analyzing males and females separately, our method reduces the overall variance in each study leading to an increase in statistical power when applying meta-analysis. I apply our method to the Northern Finland Birth Cohort data and I show that our method has increased power over the

8 traditional approach while controlling for false positives.

9 CHAPTER 2

Detecting the Presence and Absence of Causal Relationships Between Expression of Yeast Genes with Very Few Samples

2.1 Background

Inference of biological networks from high-throughput genomic data is a central prob- lem in bioinformatics where many different types of methods have been proposed and applied to a wide diversity of datasets [MS07]. Several recent studies have col- lected data in model organisms such as yeast and mouse which contain both ge- netic variations as well as gene expressions from a set of genetically distinct group of individuals. Originally, these “genetical genomics” datasets were used to identify genetic variations located at specific genomic locations that affect expression levels in the form of linkages or associations [BYC02, BK05]. These studies treated ex- pression levels as quantitative traits and each associated genomic location is called an expression quantitative trait locus (eQTL). More recently, various statistical ap- proaches have been applied to these datasets demonstrating them as being particularly powerful for teasing out the underlying biological networks that govern how genetic variations mediate differential gene expression and how genes regulate and interact with each other [LPD06, RGM02, GDZ06, STM05]. Some of these methods build on pioneering work in using graphical models to model gene regulatory networks

10 [FLN00, PRE01, SSR03, Fri04, HGJ01].

Extracting meaningful causal relationships from these networks has been a chal- lenging but important area of genetical genomics. What differs in genetical genomics studies compared to traditional microarray analysis and what makes causal inference possible is the idea to model genetic variations as random perturbations to the underly- ing regulatory network. A principled way of representing the causal relationships in a biological network is using graphical causal models [Pea88, Pea00]. Such models rep- resent causal relationships between random variables by means of a directed acyclic graph called a causal graph, where a directed edge between two variables represents direct causal influence. The data-generating process represented by a causal graph imposes a variety of constraints, such as conditional independence constraints, on the observed data. A rich theory of causal inference has been developed [Pea00, SGS00] which attempts to reconstruct aspects of the graph from the pattern of constraints in the observations. Causal relationships can then be read off directly from the reconstructed graph.

The advantage of the causal inference paradigm is that predictions made are in fact causal, and so can be directly verified with knockout, siRNA or allele swap ex- periments. Compared to other methods such as co-expression networks which aim to capture the global structure in the regulatory network, causal inference methods attempt to identify the actual biological mechanisms regulating gene expression. Fur- thermore, for many applications where the final goal is to perturb the biological system in some way, causal networks are advantageous because they naturally predict the ef- fect of possible interventions. The resulting models can be perturbed in silico to help guide which experimental perturbations to apply.

The disadvantage of these methods is that existing causal inference theory is a large sample theory, and is only guaranteed to work asymptotically. Unfortunately, in the

11 case of inferring biological networks from gene expression data, there are far fewer samples than genes, which means practical applications must be a successful synthesis of ideas from both causal inference and small sample statistics.

There are two main approaches to learning causal graphs in biological networks. Score-based methods assign scores to models which both produce high likelihood of the observed data, and have limited complexity, and search for the highest scoring model [Suz93, LB94]. These methods have been used in identifying causal regula- tors in yeast [FLN00, PRE01, SSR03, ZZS08, BH05, KJ06] and causal mediators of disease in mice [SLY05]. Constraint-based methods rule out those causal graphs in- consistent with patterns of conditional independence constraints in the observations. These methods have been applied to discovering causal relationship between pairs of genes [CES07].

In this chapter, we discover the presence and absence of causal relationships be- tween genes in yeast by examining their expression levels over a set of individuals with random genetic variations. Causal discovery is challenging in our case because there are several thousand genes, while the number of samples is very limited. In particular, most conventional conditional independence tests or model selection algo- rithms are not reliable in the small sample case, since conditioning severely reduces the number of samples available, and as a result we cannot infer independence with high confidence, limiting our ability to induce features of the causal graph.

Our approach is to rely on basic properties of graphical models to infer or exclude edge directionality based on either simple unconditional independence tests which are possible to perform even in the small sample case, and on results of simple model selection amongst small causal sub-graphs of the overall causal model which have particularly strong signals. Our philosophy is that due to the small number of samples, it is impossible to accurately recover the complete causal graph. We opt to predict only

12 the subset of the network where our predictions are likely to be correct.

We take advantage of prior biological knowledge that genetic variations affect gene expressions, but not vice versa. This knowledge can be expressed graphically as for- bidding directed paths from gene expressions to genetic variations. While in general it is not possible to recover most causal structures based on unconditional indepen- dence tests, the availability of prior knowledge allows us to “bootstrap” certain edge orientations, which in turn allows us to orient more paths as causal using basic prop- erties of d-separation (described below). Moreover, we can also rule out certain edge orientations using the same principals, thus identifying the absence of certain causal relationships.

Our method is inherently conservative, only predicting the existence and orienta- tion of edges in the causal graph if there is strong support from the sample data. As expected, our approach predicts only a small fraction of the complete causal regulatory network of yeast. However, the actual predictions made by the method are surprisingly consistent with previous experimentally validated knowledge of yeast gene regulation.

We demonstrate the utility of our method by analyzing the Brem et al. yeast strains [BYC02]. The 112 yeast strains in this dataset was created by crossing a laboratory strain with a wild strain of S. cerevisiae. Both genetic variations and gene expres- sions from each offspring have been collected. We focus our analysis on an interesting feature of this dataset known as “regulatory hotspots” or regions in the genome in which a genetic variation is correlated with the expression levels of many genes. Com- pared to traditional eQTL mapping techniques that first identified these “regulatory hotspots”, our method provides much richer causal information that simple correla- tion can not capture. First, our method allows us to infer causal relationships between pairs of genes to identify global regulators that control the gene expression of target genes correlated with a “regulatory hotspot”. Second, our method can exclude causal

13 relationships between genes. Third, even when considering only variation-expression pairs, our method can distinguish whether a variation has a direct or an indirect ef- fect on expression. While several other methods attempt to infer causal relationships between genes [ZZS08, CES07], our method is the first to be able to exclude causal relationships and distinguish between direct and indirect effects of a variation.

We evaluate our method’s ability to infer regulatory relationships by comparing it directly to two other competing methods as well as verifying our results with previous experimental validations. Using our method, of the 12 genes for which there is some experimental evidence that they behave as master regulators [YBW03, ZZS08], we re- cover 9 of them. Furthermore, for one of our predictions, ILV 6, a competing method by Zhu et al., 2008 [ZZS08] was not able to identify the gene as a causal regulator based on expression data alone. The gene was only identified when additional tran- scription factor binding data was incorporated. Combined with our method’s ability to exclude specific causal relationships, we used gene set enrichment analysis to find that gene targets not causally affected by a regulator to be enriched for different pathways and biological processes than gene targets affected by the same regulator.

To evaluate our ability to distinguish between direct and indirect effects of a genetic variation on gene expression, we take advantage of the fact that most expression tran- scripts are affected directly by a few variations close to the gene through a mechanism called cis-regulation. Since our method does not rely on information about the relative positions of a genetic variation and its effected gene, an enrichment of cis-effects in our predictions for direct causal effects validates our method.

A shorter version of this article has previously been published as part of a con- ference proceeding [KSY09]. In this article, we provide more details on our causal inference procedure by providing the exact likelihoods for each gene in a triplet. We also updated our results by systematically identifying “regulatory hotspots” using a

14 previously published method based on dividing the genome into discrete sections and approximating the appearance of a linkage as a Poisson process [BYC02]. Using this method, we identified 9 “regulatory hotspots” and 38 regulator genes which mediate the genetic variations. Finally, we provide two additional visualizations for the causal relationships we discover. We use a spring embedded algorithm to construct the yeast causal network and show that the “regulatory hotspots” overlap well with the inherent hub structures. We also use a representation grouped by the “regulatory hotspots” to show that there is significant cross talk between hotspots.

15 2.2 Methodology

2.2.1 Causal Graphs for Genetical Genomics

We first introduce the machinery of causal inference needed to formalize our approach to inferring causal relationships between a genetic variation (a SNP) and the expres- sion of a pair of genes. Our primary object of study is the probabilistic causal model [Pea00].

Definition 1. A probabilistic causal model (PCM) is a tuple M = hU, V, F,P (u)i, where

• U is a set of background or exogenous variables, which cannot be observed or experimented on, but which can influence the rest of the model.

• V is a set {V1, ..., Vn} of observable or endogenous variables. These variables are considered to be functionally dependent on some subset of U ∪ V.

• F is a set of functions {f1, ..., fn} such that each fi is a mapping from a subset S of U ∪ V \{Vi} to Vi, and such that F is a function from U to V.

• P (u) is a joint probability distribution over the variables in U.

PCMs represent causal relationships between observable variables in V by means of the functions in F: a given variable Vi is causally determined by fi using the values of the variables in the domain of fi. Causal relationships entailed by a given PCM have an intuitive visual representation using a graph called a causal diagram. In this graph, each node is represented by a vertex, and a directed edge is drawn from a variable X to a variable Vi if X appears in the domain of fi. A graph obtained in this way from a model is said to be induced by said model.

16 A node Y is an ancestor of node Z in a causal diagram G if there is a directed path from Y to Z. Causal diagrams are generally assumed to be acyclic. While we expect the full yeast regulatory network to have causal cycles (they serve as common regulatory mechanisms), in this chapter we concentrate our efforts on the fragments of the overall network where acyclicity holds.

One advantage of causal graphs, and graphical models in general [Pea88, JW02] is their ability to represent conditional independence relations between variables in a qualitative and intuitive way using the notion of path blocking known as d-separation [Pea88]. Two variables X,Y are d-separated if all causal and confounding paths from X to Y contain at least one variable whose value is known, and the value of no common effect of both X and Y is known. Every d-separation statement involving two nodes (or sets of nodes) in the graph corresponds to a conditional independence among the corresponding sets of variables. That is, if every path from X to Y is blocked or d- separated by Z in a causal diagram G, then X and Y are conditionally independent given Z in every probability distribution compatible with G [Pea88]. Furthermore in stable [PV91] or faithful [SGS93] models the converse is also true: conditional independencies in the observations imply the corresponding d-separation statement holds in the underlying causal graph. The faithfulness assumption thus allows us to infer aspects of the generating causal graph from conditional independence constraints apparent in the data, and is crucial for inductive causal inference. Faithfulness holds in “most” causal models, and can thus be justified on Occam’s Razor grounds [Pea00].

Constraint-based inference of correct edge orientations in a causal diagram has two fundamental limits in practical applications. The first is that it can be difficult to collect sufficient samples to perform reliable conditional independence tests, and the second is that some causal diagrams may disagree on orientations of particular edges while entailing the same set of conditional independence constraints (such causal diagrams

17 are called Markov-equivalent [VP90]).

In this chapter, we will use causal graphs to represent causal interactions between genetic variations and gene expression levels in yeast. In our case the genetic varia- tions is the set of single nucleotide polymorphisms (SNPs). In this chapter, we limit our focus to inferring the presence or absence of a causal relationship between gene expression levels based on independence tests and model selection we can actually perform. We will be relying on the following three (elementary) theorems in graphical models.

Theorem 1. Let G be a causal graph where X is d-connected to Y via a path ending in an arrow pointing to Y , X is d-connected to Z, and X and Z are d-separated by Y . Then Y is an ancestor of Z.

If we assume faithfulness, this theorem implies we can infer causal directionality based on the result of two unconditional independence tests, and one conditional inde- pendence test. In our case, X is a SNP, Y is the expression level of a gene and Z is the expression level of a second gene. We are using our prior knowledge that expres- sion levels do not affect SNP values to satisfy one of the preconditions of the theorem, namely that the d-connected path must end in an arrow pointing to Y . In particular, if Y is a gene expression value, and X is a SNP value correlated with on Y , then Y cannot cause X. Using this theorem, we are able to infer a causal relationship between the expression levels of genes Y and Z.

Unfortunately, testing whether X is conditionally independent of Z given Y in the small sample case is not feasible. An alternative approach which is more appropriate in our case is to use a model selection method, that is rather than performing the independence test, find the causal model over the local variables of interest, and read off causal directionality from its graph. In general, if we restrict ourselves to a small part of a large causal model which contains three variables X,Y,Z, the causal diagram

18 which captures conditional independencies in the corresponding marginal distribution, that is P (x, y, z), will be a mixed graph containing both directed and bidirected arcs, called a latent projection [VP90]. In latent projections, a directed arc from X to Y corresponds to a d-connected path which starts with an arrow pointing away from X and ends with an arrow pointing towards Y in the original, larger graph such that every node on the path other than X and Y is marginalized out or latent. Similarly, a bidirected arc from X to Y corresponds to a d-connected path in the larger graph which starts with an arrow pointing to X, ends with an arrow pointing to Y , and every node on the path other than X and Y is marginalized out or latent.

If we restrict ourselves to local models of three variable marginal distributions, where certain causal relationships are excluded due to prior knowledge (e.g. genes cannot cause SNPs), the complete set of causal hypotheses is captured by a small set of latent projections.

Theorem 2. Let G be a causal graph where X is an ancestor of Y and Z. Then the latent projection which represents conditional independencies of P (x, y, z) is one of the graphs in Figure 2.3(a).

Theorem 2 allows us to select the graph in Figure 2.3(a) which best fits the available data (we use a version of the likelihood ratio test), and use this graph to conclude causal directionality. The next theorem allows us to conclude the opposite, that a variable cannot be a causal ancestor of another.

Theorem 3. Let G be a causal graph where X is d-connected to Y , and X and Z are d-separated. Then Y cannot be an ancestor of Z.

As before, faithfulness allows us to apply this theorem to conclude the absence of causal directionality based on the results of two unconditional independence tests. In our case SNP X is associated with the expression level of gene Y , and SNP X is

19 independent of the expression level of gene Z. In this case we can rule out a direct causal relation between expression levels of genes Y and Z. In our case, the possible models are shown in Figure 2.3(b). In the small sample case we again use a maximum likelihood method to perform such tests.

In the next section, we describe our statistical methodology in more detail.

2.2.2 Inference Algorithm Overview

Our algorithm for inferring the presence or absence of causal relationships of gene expression proceeds in four steps. First, we find for every gene expression, the set of potential causal SNPs using the standard F-test. Second, we infer the presence of causal relationships between pairs of genes correlated with the same SNP by com- paring the likelihoods of possible models. Third, we distinguish between direct and indirect effects of genetic variation on gene expression. Fourth, we infer the absence of causal relationships based on the results of step one and Theorem 3.

2.2.3 Finding potential causal SNPs

In the first step, we attempt to find, for every gene expression level, the set of potential causal SNPs, in other words the set of SNPs which are either causal or which are confounded with causal SNPs.

To examine the (potential) causal relationship between SNP Si and expression level

Ej in our small sample case, we assume the following linear relationship between the two: Ej = αSi +. We use an arrow notation to signify potential causality (→) and the negation (6→) as no potential causality. Under the null hypothesis of no potential causal relationship between the SNP and expression levels (Si 6→ Ej), we expect α = 0 (H0).

Under the alternate hypothesis of a potential causal relationship (Si → Ej), we expect

20 α 6= 0 (H1). To decide between these hypotheses, one could calculate the likelihood

L(H0) ratio statistic xij = −2 log or use the standard F-test which is related to the likeli- L(H1) xij hood ratio statistic Fij = (N −2)e N −1 and follows asymptotically the F distribution with 1, k−2 degrees of freedom where k is the number of samples. We calculate the F statistic Fij for every SNP/expression pair (Si,Ej). To assign significance, we shuffle

0 the labels of the individuals B times to obtain the null statistics Fijb, b = 1, 2, ..., B. Then the p-value of each SNP and expression pair can be calculated by looking at the ranking of the statistic of the pair in the permuted null statistic distribution.

We can easily estimate the false discovery rate (FDR) for our statistic using previ- ous approaches [ST03]. To limit the number of potential causal networks to evaluate in subsequent steps, we filter the SNP/expression pairs for those with a FDR of q < 0.01.

Due to linkage disequilibrium or local correlation of variation, the SNPs which are correlated with expressions are not likely to be actually causal, but instead correlated with causal SNPs in the same genomic region. Since all of the SNPs are correlated in a region, this does not affect our ability to make inferences about the causal regulatory network, but we must keep in mind that the SNPs which we predict to have direct effects are likely proxies for the true causal variants.

2.2.4 Finding causal relationships between genes

The next stage of our algorithm consists of inferring causal directionality between gene expressions by using Theorem 1 and Theorem 2. Since the two unconditional independence tests have already been performed in the first step, all that remains is to test conditional independence. Unfortunately, conditional tests present a problem in the small sample case since conditioning further limits the number of samples we have to test. An alternative approach is to consider multiple models consistent with the results of the unconditional independence tests where in some models the conditional

21 independence holds, and in others it does not. If a model where the conditional inde- pendence test holds is the best fit for the data, and moreover accounts for more of the fit compared to a “default” model making no conditional independence assumptions, then we assume the conditional independence is likely true.

In our case, we are considering fragments of the causal graph consisting of a single

SNP S and two expression levels Ei,Ej dependent on S (due to step 1). Figure 2.3(a) shows the nine possible causal models in the case that all of the elements are pairwise correlated. In H1 the SNP affects both expression levels independently. In H2 and H3 there is a direct causal relationship between the two expression levels. The “default” models H4 through H9 impose no constraints on the data and are indistinguishable based on conditional independence tests. Since they are all equivalent, for simplicity, we only consider H4 below.

We obtain information about the network whenever we predict a triplet to have a model H1, H2 or H3. To distinguish between the three hypothesis H1, H2 and H3, we compute likelihood ratio statistics for each hypothesis against the alternative H4, and conclude that a hypothesis is likely true if the corresponding ratio exceeds the other ratio (e.g. fits better than the other simple hypothesis) and is close to unity (e.g. a simpler hypothesis accounts for the observations). The fact that the likelihood ratio is close to unity means that the missing edge in the triplet does not hurt the likelihood of the model compared to “default” model (H4). This is equivalent to the standard approach of performing a likelihood ratio test for model selection taking into account a complexity penalty. In this case, the complexity penalty would be applied to H4 since the model has an additional degree of freedom. We also pairwise compare the likelihoods between H1, H2 and H3 against each other and only consider triplets where the most likely hypothesis is more likely than the others using a threshold.

We compute the likelihood for each model by computing the likelihoods at each

22 target node. Since we are interested only in the causal effects on individual genes, we can represent the causal effects on an individual gene using a linear model assuming

Gaussian noise. For every triplet, we can write the following linear model for genes g1 and g2 and the common associated SNP s.

g1 = µ1 + βg2 g2 + βs1s + e1 (2.1)

g2 = µ2 + βg1 g1 + βs2s + e2 (2.2)

,where µ1 and µ2 are the means for g1 and g2 respectively, and βg2, βs1, βg1, and βs2 are causal coefficients for g2, s, g1, and s to their causal target nodes respectively, and e1 and e2 represent noise terms which follow Gaussian distribution. In this model, all coefficients are estimated by maximum likelihood estimation. The regression coeffi- cients are interpreted as the Wright’s rule [Wri21], sum of path products of coefficients in the underlying (and unknown) true causal graph.

Since we assume that each gene expression is independently sampled from a un- derlying generative model, computing the likelihood of the model given data is done by multiplying all the Gaussian density of errors calculated by least square method. We can represent this mathematically as follows:

k 2 Y 1 (Xi − µˆ) L(M|D) = √ exp(− ) (2.3) ˆ2 i=1 σˆ 2π 2σ

where, Xi is the data sample, and k is the number of samples, and σˆ and µˆ is computed by maximum likelihood estimation from given data.

23 2.2.5 Distinguishing between direct and indirect effects of variation

If a SNP S is associated with two genes Ei and Ej, the nine possible models are H1,

H2, H3 and the “default” models H4 through H9. The models H2 and H3 explain the associations as a direct effect of the SNP on one gene and an indirect effect on the other. Model H1 suggests that the SNP directly affects the expression levels of both genes. Since our statistical methodology uses H4 as the default model, we are unable to distinguish between direct and indirect effects if we can not classify a triplet as one of either H1, H2, or H3.

Establishing direct and indirect effects in causal analysis is always done with re- spect to particular model granularity. This is because it is generally always possible to observe intermediate variables between any direct cause and its effect – finer gran- ularity removes directness of causation. In our case, when distinguishing direct versus indirect effects, the specific triplet that we are observing determines whether or not an effect is direct or indirect. Consider the following motivating example of a SNP S and three genes with the underlying network S → E1 → E2 → E3. If we consider the triplet (S,E2,E3), the correct structure of the subgraph is H2 and S will have a direct effect on E2 and an indirect effect on E3. Now if we consider the triplet (S,E1,E2), the correct structure is again H2 and S will have a direct effect on E1 and an indirect effect on E2. Intuitively, this is because when we consider the triplet (S,E2,E3), E1 is unobserved. Thus each prediction of a triplet as H2 or H3 induces a partial order on the causal relationships between gene pairs. After examining all pairs, we return the minimum set of causal relationships which are consistent with all of the triplet predictions.

More complicated networks introduce ambiguity into our ability to distinguish be- tween direct and indirect effects. For example, if we add the edge S → E3 in our example, we can still identify E1 → E2 as a direct effect from the triplet (S,E1,E2),

24 but are unable to identify E2 → E3 as a direct effect. This is because we will predict the structure of each triplet containing E3 and either E1 or E2 as H4 where the effects are ambiguous. However, if there is an additional edge in the graph E3 → E4, the triplet (S,E3,E4) would identify E3 → E4 as a direct effect.

2.2.6 Excluding causal relationships between genes

The ability to exclude certain causal relationships between genes, an inherent advan- tage of causal analysis, is important to obtaining a more complete understanding of genetic regulation. For example, a gene might be causal to a number of genes enriched for a biological process but not causal to a number genes enriched for a different bi- ological process even though it is correlated with both sets of genes. We attempt to determine the absence of causal relationships by looking at a SNP and a pair of genes where the SNP is the potential cause of one gene, but not the other. In this case, basic properties of d-separation (Theorem 3) guarantee that there are only four possibilities

H10 through H13 (see Figure 2.3(b)).

In H10, the SNP affects gene expression Ei, but gene expression Ej is completely independent from both the SNP and gene expression Ei. In H11, both the SNP and gene expression Ej affect gene expression Ei simultaneously. In H12, the SNP affects gene expression Ei, and gene expression Ei and gene expression Ej has a hidden common causal parent. In H13, both the SNP and gene expression Ej affect gene expression Ei and at the same time, gene expression Ei and gene expression Ej has a hidden common causal parent. In none of these models, gene expression Ei affects gene expression Ej.

We model the association between a SNP and a gene expression using a linear Gaussian model as in Section 3.2. We correct multiple hypothesis testing problem by computing the false discovery rate (FDR) [ST03]. We identify pairs of genes Ei and

Ej where we can exclude causal relationships using the following criterion: a SNP is

25 significantly associated with Ei (FDR of q < 0.01) and not associated with Ej (FDR of q > 0.9).

26 2.3 Results

We applied our method to an expression dataset of 5534 genes and a genotyping dataset of 2956 SNPs collected over 112 genetic segregants of yeast. After step 1, we found 42331 (SNP, expression) pairs where the SNP is causal to the expression at a FDR of q < 0.01. We constructed triplets from these causal pairs to significantly reduce the number of possible causal models to evaluate for causal relationships between the genes in step 2. For each triplet, we considered the four possible models H1,

H2, H3 and H4 and identified the most likely as described above. We find the most likely of H1, H2 and H3 and required that the log likelihood difference of the best model be within 2 of H4. This is equivalent to penalizing the likelihood of H4 and applying using the likelihood ratio for model selection. Inferring causal relationships with few samples can result in directional and causal conflicts. A directional conflict occurs when the direction of causation predicted between two genes is inconsistent using different SNPs. A causal conflict occurs when the presence and absence of a causal relationship predicted between two genes is inconsistent using different SNPs. We examined the robustness of our method by quantifying the number of directional and causal conflicts. Directional conflicts result when triples containing the same pair of genes and different SNPs predict different directional causal relations between the genes. Causal conflicts result when different triples both predict and exclude the same causal edge. As Table 2.1 shows, consistent across complexity penalties, fewer than 3% of predicted causal relationships are in conflict. These prediction conflicts are due to the limited number of samples available. We exclude all conflict predictions from our final result.

The genetic variations inherent in the individuals we study can be seeing as nat- urally occurring random perturbations to the underlying regulatory networks that ul- timately give rise to subtle differences in gene expression. We present our results in

27 the context of these regulatory networks by identifying genes that are directly effected by the SNPs, regulators and those genes that are controlled by the regulators, targets. Formally, we call a gene a regulator if there exists a directed edge from a cis SNP to the gene and a gene a target if there exists a directed edge from a regulator to the gene. Intuitively, the requirement for a causal cis SNP ensures a high probability that the SNP directly perturbs the gene expression of the regulator. In our data, we found 3370 causal relationships consisting of 212 causal regulator genes and 1396 affected target genes. Table 2.1 shows the number of causal relationships, causal regulators and affected target genes discovered using various model complexity penalties for H4.

One way to make sense of the large number of causal relationships detected is to look for causal regulators that affect a number of genes or “causal hubs”. Of particular interest is identifying causal regulators that are associated with “regulatory hotspots”, defined as regions of the yeast genome linked to the expression of a large number of genes. Presumably, these “causal hubs” are important regulatory elements that lead to subtle changes in expression of genes belonging to a number of different biolog- ical processes and functions. Previous analyses have identified several “regulatory hotspots” in the yeast genome but very little is known about the corresponding “causal hubs” because of the limited resolution of genotyping studies. In a few isolated cases, several groups have performed experimental knock out studies to confirm the exis- tence of causal regulators and allele swap studies to further show that these regulators are perturbed by the corresponding “regulatory hotspot” [YBW03, ZZS08].

We first identified 9 “regulatory hotspots” similar to previous methods [BYC02] by dividing the genome into 611 bins and approximating the number of linkages ex- pected in each bin as a Poisson process. Figure 2.1, 2.2, shows the complete causal network inferred by our method with regulators and targets colored by the “regula- tory hotspots” they belong to. Gray nodes indicate that a gene does not belong to

28 any identified “regulatory hotspot”. Figure 2.1 shows the spring embedded network where the position of the nodes are determined so that the Euclidean distance is ap- proximately proportional to the geodesic distance between two nodes [KK89]. Several regulatory hotspots overlap remarkably well with the inherent hub structures that are present in this representation including hotspot 2 (bright red), hotspot 3 (bright green) and hotspot 9 (light blue). Figure 2.2 shows the same causal network but with the nodes grouped in a circle by the “regulatory hotspot” they belong to. This representa- tion shows that there is significant cross talk between the regulatory hotspots and there is a significant number of genes, indicated by the gray nodes, that are not part of any regulatory hotspot in our causal network.

We further summarize our results by examining each regulatory hotspot in de- tail. The 38 genes which are involved in the 9 regulatory hotspots among the top 45 genes which have the largest number of targets are summarized in Table 2.2 (The 7 genes which doesn’t belong to 9 hotspots include SDS24(60), LYS2(17), URA3∗(17), GAS2(19), NMA111(20), NAM9+(14), YML133C(30) ) . Both Chen et al. [CES07] and Zhu et al. [ZZS08] applied causal inference methods to the same data allowing us to perform a direct comparison of the results. Among the genes suspected to be global regulators in the hotspots, there are a total of 12 causal regulators with some exper- imental evidence. Nine were proposed by the original group that collected the data: AMN1, MAK5, LEU2, MATALPHA1, URA3, GPA1, HAP1, SIR3 and CAT5 [YBW03]. Three additional were validated in Zhu et al.: ILV6, SAL1 and PHM7 [ZZS08]. Our method discovers all but 3 of these (MAK5, SIR3 and CAT5). We note that SIR3 and CAT5 have much weaker experimental evidence than the others and none of the com- parison methods (neither Chen et al. [CES07] or Zhu et al. [ZZS08]) were able to find these three. The best validation of our method is that we were able to find ILV6 which was experimentally validated in Zhu et al. [ZZS08]. However, Zhu et al. [ZZS08] used additional types of data (incorporating TFBS data from ChIP-chip experiments, phy-

29 logenetic conservation, and protein protein interaction data (PPI)) in order to discover ILV6 and they claim that they would not have been able to discover ILV6 if they used only the data that we used. We note that ILV6 was also suggested as a regulator for this hotspot by Kulp et al.[KJ06]. We recover the highlighted genes from Chen et al. [CES07] including NAM9 which was not found by Zhu et al. [ZZS08] and is supported by “bioinformatics type evidence” (GO analysis, etc). A direct comparison to Chen et al.[CES07] is difficult because their results are organized in a different way, yet our results are consistent with Chen et al.[CES07] in that they highlight their discovery of 6 of the experimentally validated regulators which we also discover.

Table 2.2 summarizes our results. Experimentally validated predictions are shown in bold. Regulators with an asterisk (*) were found by Zhu et al. [ZZS08]. Regulators marked with a plus (+) were found in the Chen et al. [CES07] study and unlabeled reg- ulators are novel predictions. In parentheses after the name of the regulator is the num- ber of targets that we found. We note that in most cases the experimentally validated regulator is at the top of the list. We also observed that with various model complexity cut off, the ranking of predicted genes is maintained, if the model complexity cut off is less than a certain threshold. Of particular interest are a group of regulators linked to chromosome 14 which is enriched for mitochondrial genes. Previous published studies in yeast did not identify any putative regulators in this region [YBW03]. We found a number of genes in this region including three previously identified genes SAL1 and TOP2 and several proteins of unknown function including NMA111 and YNL035C.

We validate our ability to distinguish between direct and indirect effects of varia- tion by considering the genomic positions of SNPs and the locations of genes that they are associated with. Variation that affects expression can be classified into two broad categories: cis-regulation which is an effect of a variation near a gene that affects ex- pression of the gene and trans-regulation which is an effect of variation located in one

30 region of the genome affecting expression of genes in other regions. It is suspected that most cis-regulation is direct while trans-regulation may be either direct or indi- rect. Of the 42, 331 SNP gene pairs where the SNP is associated with the expression of the gene, 11, 328 are predicted as cis-regulated gene while 31, 003 are trans-regulated gene. Using our approach, out of the 11, 328 cis-regulated genes, we predict 9, 385 of them to have a direct effect on expression. Out of 31, 003 the trans-regulated genes, 20, 509 of the SNP gene pairs have indirect effects. Thus cis-regulated genes are en- riched in our predicted set of directly affected genes, while trans-regulated genes are enriched in indirectly affected genes.

We speculate that the identified causal regulators are likely to either directly con- trol or perturb biological processes. However, step 3 of our analysis also identifies a collection of genes that are causally irrelevant to other genes. Combining results from these two steps can help us identify specific biological processes that are either regu- lated or not regulated by these causal regulators. We examined those eight significant causal regulators from our results with previous experimental validation. For each reg- ulator, we construct two sets of genes, those that are causal targets and those that are causally irrelevant. We then use the hypergeometric distribution to assess the statisti- cal significance of overlap of each gene set to known gene sets. Table 2.3 shows the different GO pathways that are enriched when we performed this analysis. The eight regulators appear to be involved in very different biological processes. For example, AMN1 is a causal regulator for ribosome biogenesis and assembly while four other regulators LEU2, MATALPHA1, URA3 and ILV6 are causally irrelevant for the pro- cess. Similarly, SAL1 is a causal regulator for the process of translation while HAP1 and PHM7 are causally irrelevant for the process. We notice that all significant pro- cesses are crucial for cell growth and survival but are controlled by different global regulators. The causal analysis shows that most of these global regulators participate specifically in certain biological processes. The one example of multiple regulators

31 from the same regulatory hotspot includes LEU2 and ILV6. In this case, these two reg- ulators participate in similar biological processes of organic acid metabolic process and amine biosynthetic process respectively. We further confirmed the specificity of these global regulators by enrichment analysis for localization of the causal and causally irrelevant targets. For example, SAL1’s causal targets are enriched for localization to the ribosome while HAP1’s targets are enriched for localization to the mitochondrial membrane. Furthermore, both PHM7 and HAP1’s causally irrelevant targets local- ized to cytosolic region of the cell where translation takes place. Similarly, although LEU2 and ILV6’s causal targets are not enriched for a specific cellular compartment, their causally irrelevant targets are both enriched for the nucleolus where ribosome biogenesis and assembly takes place.

32 2.4 Discussion

In this chapter we combined a principled representation of causality using graphical causal models with small sample statistical methods to infer the presence and absence of causal relationships between yeast genes. Working with a dataset of genetically identical yeast strains allowed us to make strong causal assumptions about edge di- rectionality in the underlying causal model. These assumptions, in turn, allowed us to take maximum advantage of the limited samples we had available by employing either unconditional independence tests, and simple model selection to discover or exclude causal directionality between gene expressions. This work motivates theoret- ical questions about the limits of causal inference based on either restricting or elim- inating conditional independence tests, and relying strictly on unconditional tests. In addition, our method does not explicitly account for hidden confounding effects and could potentially make erroneous predictions. Detecting causal relationships with la- tent variables is a challenging and active area of both theoretical and applied research. Promising new techniques have been suggested and can potentially be incorporated into our method.

We demonstrated the usefulness of our method by examining yeast expressions collected over a segregated population derived from two parental strains to identifying many experimentally validated causal regulators. In addition, our approach is able to distinguish between direct and indirect variations and exclude causal relationships between genes. These results provide a rich description of the yeast gene regulation network beyond any previous results from mapping studies, coexpression analysis and competing causal methods.

Several interesting extensions can be applied to our method. One can either empir- ically or theoretically characterize the strength of effects recoverable by our method to hypothesize about the strength of regulation between genes. Many biological networks

33 are in fact cyclical in nature and the assumption of certain type of noise structures has been shown to be useful in identifying cycles in causal graphs. Finally, incorporating additional phenotype information can potentially help us understand the genetic basis of complex phenotypes.

34 2.5 Figures

35 Figure 2.1: Complete causal network in yeast with the nine regulatory hotspots col- ored. Circles designate regulators, squares designate targets and diamonds designate genes that are both regulators and targets. The spring-embedded view of the causal net- work shows that some hotspots, hotspots 2 (red), 3 (bright green), and 9 (light blue), overlap well with the hub like structures of the network where regulators are positioned in the middle and targets surround the causal hub.

36 Figure 2.2: Complete causal network in yeast with the nine regulatory hotspots col- ored. Circles designate regulators, squares designate targets and diamonds designate genes that are both regulators and targets. Causal network grouped by hotspot shows that some regulators and targets (indicated by gray) are not part of known regulatory hotspots.

37 SSS S

EEEEEE EE iHHH j i j i j iH j 1 2 3 4

S S S S S

EE EE EE EE EE iH j iH j iH j iH j iH j (a) 5 6 7 8 9 SS S S

EEEE EE EE iH j i H j i H j i H j (b) 10 11 12 13

Figure 2.3: Possible causal graphs relating a triplet considering a SNP S with the level of gene expression for genes Ei and Ej. Bidirected edges denote hidden common causes. (a) Nine possible causal models consistent with S being a causal ancestor of Ei and Ej (models H4 through H9 are indistinguishable from observations of the triplet). (b) Four possible causal models consistent with S being a causal ancestor of

Ei while being uncorrelated with Ej.

38 2.6 Tables

39 Complexity # Causal # Affected # Causal # Direction # Causal Penalty Regulators Genes Relationships Conflicts Conflicts

1 146 1106 2135 34(1.6%) 7 (.3%) 1.5 183 1272 2794 30 (1.1%) 11 (.4%) 2 212 1396 3370 44 (1.3%) 13 (.4%) 2.5 240 1524 3983 59 (1.5%) 18 (.5%) 3 266 1615 4571 81 (1.8%) 20 (.4%)

Table 2.1: Summary Statistics For Different Likelihood Thresholds

40 Hot SNP SNP Num Regulators Spot Chr Loc Targets

1 2 370000 TAT1(49) 49 AMN1∗+(226), YSW1+(190), TBS1∗(177), 2 2 530000 CNS1∗+(166), ARA1∗(162), SUP45∗(31), 303 AGP2(17), TOS1∗(16), YBR137W(16) NFS1∗(106), CIT2∗(100), LEU2∗+(77), 3 3 90000 169 HIS4(66), ILV6∗+(29) 4 3 200000 MATALPHA1∗(40), MATALPHA2 (24) 41 5 8 110000 GPA1∗+(15) 15 6 12 640000 HAP1∗(22), MAP1∗(22) 40 7 12 1060000 YLR464W∗(32), YRF1-4∗(30), YRF1-5∗(22) 33 SAL1∗+(138), LAT1(77), COG6(69), TOP2∗(62), 8 14 503000 320 MSK1(38), YNL035C(38), SWS2(17) PHM7∗+(227), RFC4(96), NDJ1∗(69), HAL9∗(66), 9 15 150000 263 ZEO1(55), WRS1(38), SKM1(28), YOL092W(18)

Table 2.2: Regulatory Hotspots and Corresponding Regulators

41 Regulated Targets Unregulated Targets Gene GO Pathway p value GO Pathway p value

Ribosome biogenesis Establishment of AMN1 1.7 × 10−34 2.3 × 10−7 and assembly localization Organic acid metabolic Ribosome biogenesis LEU2 3.0 × 10−7 3.5 × 10−10 process and assembly Ribosome biogenesis MATα1 Biological regulation 5.9 × 10−6 3.6 × 10−11 and assembly De novo pyrimidine Ribosome biogenesis URA3 4.6 × 10−6 5.0 × 10−6 base biosynthetic process and assembly Mitochondrial electron HAP1 2.4 × 10−10 Translation 3.6 × 10−13 transport chain Amine biosynthetic Ribosome biogenesis ILV6 2.4 × 10−17 2.9 × 10−16 process and assembly Chromosome organization SAL1 Translation 7.9 × 10−30 1.3 × 10−6 and biogenesis Carbohydrate metabolic PHM7 3.7 × 10−9 Translation 2.0 × 10−8 process

Table 2.3: Significantly Enriched Processes For Causal and Not Causal Genes

Reference to published article

Eun Yong Kang, Chun Ye, Ilya Shpitser and Eleazar Eskin, “Detecting the presence and absence of causal relationships between expression of yeast genes with very few samples”, in Journal of Computational Biology, March, 2010.

42 CHAPTER 3

Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features

3.1 Introduction

3.1.1 Background

In many scientific problems, identifying causal relationships is an important part of the solution. The gold standard for identifying causal relationships is a randomized experiment. However in many real world situations, the randomized experiment can- not be performed due to various reasons such as ethical, practical or financial issues. Therefore identifying causal relationships from observational data is an unavoidable and important step in understanding and solving many scientific problems.

A popular tool to represent causal relationships is a graphical model called a causal diagram [Pea88], [Pea00]. A causal diagram is a directed acyclic graph (DAG), which consists of nodes and directed edges. Nodes represent variables of interest while di- rected edges between nodes represent directed causal influence between two end nodes of those edges. A causal diagram is a data generating model, with a bundle of directed arrows pointing to each node representing the causal mechanism which determines values of that node in terms of values of that node’s direct causes.

The ideal goal of inferring causal relationships from observational data is to iden-

43 tify the exact data generating model. However inferring causal relationships has a fun- damental limitation, if we only use observational data. That is, the best we can infer with observational data is the Markov equivalence class [VP90] of the data generating model.

The Markov equivalence class is the set of graphical models which all represent the same set of conditional independence assertions among observable variables. All graphical models in a Markov equivalence class share the same set of adjacencies and v-structures. Here a v-structure represents a node triple where two nodes are non- adjacent parents of another node. If we find the correct Markov equivalence class of the data generating model, then the Markov equivalence class contains the true data generating model as its member. The true generating model and other models in its Markov equivalence class must agree on directions of certain edges, while possibly disagreeing on others. Since all models in the Markov equivalence class share v- structures, all edges taking part in v-structures must have the same direction in all models in the class. Furthermore, certain other edges must have the same direction in the entire class as well. These edges have the property that reversing their direction would entail the creation of a v-structure which is not present in the class.

3.1.2 Previous Approaches and Their Limitations

Computing marginal posterior probabilities of graphical features considering all the possible causal network structures is computationally infeasible due to the exponential number of possible network structures, unless the number of variables is small. Due to this problem, approximation methods for computing marginal posterior probabil- ities of graphical features such as directed or undirected edges have been proposed. Madigan and York [MY95] applied a Markov chain Monte Carlo (MCMC) method in the space of all the network structures. Friedman and Koller [FK03] developed a

44 more efficient MCMC procedure which applies to the space of all orderings of vari- ables. The problem with these approximation methods is that inference accuracy is not guaranteed in the finite runs of MCMC.

Several methods have been developed to compute the exact posterior probabil- ity of graphical features using dynamic programming algorithms. Koivisto and Sood [KS04] proposed a DP algorithm to compute the marginal posterior probability of a single undirected edge in O(n2n) time, where n is the number of variables. Koivisto [Koi06] also developed a DP algorithm for computing all n(n − 1) undirected edges in O(n2n) time and space. In the above two DP algorithms, the number of adja- cent node is bounded by a constant. These methods compute the posterior probability by marginalizing over all the possible orderings of variables in the graphical models. These DP approaches and order MCMC approaches require a special prior for the graphical structure when summing over all the possible orders of the variables. The drawback of these structure priors is that due to averaging DAGs over variable order- ing space instead of over all possible structures, this structure prior results in biased posterior probability of features and leads to incorrect inferences [EW08]. To fix this bias problem, new MCMC algorithms were proposed recently [EM07], [EW08]. But these sampling-based algorithms still cannot compute the exact posterior probability.

Tian and He [TH09] proposed a new dynamic programming algorithm which com- putes marginalized posterior probabilities of directed edges over all the possible struc- tures in O(n3n) total time when the number of parents per node is bounded by a con- stant. This algorithm requires longer running time than Koivisto’s approach [Koi06] but it can compute exact or unbiased posterior probabilities since it marginalizes over all the possible structures instead of all the possible orders of variables.

However there is a problem with the approach of predicting only directed edges for the two following reasons. First, if we predict a directed edge for two fixed nodes

45 !" " !" "" #"

!" " !" "" #"

!" " !" "" #"

Figure 3.1: Three graphical models from a Markov equivalent class a and b, there are only three possible predictions: a → b, a ← b, or a b, where a b means no edge between a and b. If the presence of an edge is predicted, then there are two possibilities. If the edge is constrained to always point in the same direction in every model in the Markov equivalence class, then, in the limit, its posterior should approach 1. On the other hand, if an edge can point in different directions in different models in the Markov equivalence class, then, in the limit, the posterior probability of a particular orientation should approach the fraction of the models in the class which agree on this orientation. For example, in the class showed in Figure 3.1, in the limit,

1 2 P (c → d) = 3 while, P (c ← d) = 3 . Since the scale of posterior probabilities of these two kinds of edge features is different, this makes it difficult to decide a single appropriate posterior threshold. However one can say that scale difference problem in the posterior computation can be addressed by direct scale adjustment. But direct scale adjustment is not possible, since we do not know in advance which edges are compelled to point in the same direction across the equivalence class, and which are not. Second, we usually have only limited samples available in real world situations. This can lead to inaccurate posterior probability computation. This small sample error can make the scale difference problem worse when considering only directed edges for posterior computation, which leads to determining an inappropriate decision threshold.

46 3.1.3 Proposed Approach

The scale difference of the posterior probabilities of directed edge features between different kinds of edges makes it difficult to determine a single appropriate threshold. Therefore to correctly recover the model, inferring only directed edge features is not sufficient.

The following well-known theorem [VP90] will guide us in the choice of features.

Theorem 4. Any two Markov-equivalent graphs have the same set of adjacencies and the same set of v-structures, where a v-structure is a node triple hA, B, Ci, where A, C are non-adjacent parents of B.

Theorem 4 gives us a simple set of features to test in order to completely deter- mine the Markov equivalence class of the data generating model: adjacencies and v- structures. If the posterior probabilities of adjacency features and v-structure features are computed by marginalizing over all possible DAGs, the scale of the posterior prob- abilities of these two graphical features is the same unlike the posterior probabilities of the directed edge features. Therefore we propose that posteriors of both adjacency and v-structure features must be computed, if one wants to recover the Markov equiv- alence class of true data generating model by computing posteriors of the graphical n(n−1) features. For a graph of size n, our feature vector consists of 2 binary features n−1 for adjacencies and n 2 binary features for v-structures. For the efficient posterior computation, we provide a novel dynamic programming algorithm which computes n(n−1) n−1 the posterior probabilities of all of 2 adjacency and n 2 v-structure features in O(n33n) time.

47 3.2 Posterior Computation of Graphical Features

To learn the causal structure, we must compute the posteriors of causal graphical fea- tures such as directed edges. The posterior probability of a graphical feature can be computed by averaging over all the possible DAG structures.

The posterior probability of a certain DAG G given observational data D can be represented as follows:

P (G|D) ∝ L(G)π(G), (3.1) where L(G) is the likelihood score of the structure given the observation, and π(G) is the prior of this DAG structure. The likelihood score computation can be performed based on the underlying model representing the causal relationships. We will talk more about the model that we used for our experiments in the experiment section.

In this chapter, we take widely accepted assumptions including parameter indepen- dence, parameter modularity [GH94], and prior modularity which simplifies our learn- ing task. Parameter independence means that each parameter in our network model is independent. Parameter modularity means that if a node has the same set of parents in two distinct network structures, the probability distribution for the parameters as- sociated with this node is identical in both networks. Prior modularity means that the structural prior can be factorized into the product of the local structure priors. Now if we assume global and local parameter independence, parameter modularity, and prior modularity, L(G) and π(G) can be factorized into the product of local marginal likeli- hoods and local structural prior scores respectively.

n Y P (G|D) ∝ L(Vi|P aVi )πi(P ai), (3.2) i=1 th where n is the number of variables, Vi is the i node, P aVi is the parent set of node

48 Vi, and πi(P ai) is the structure prior for the local structure which consists of node i and its parent set P ai. This factorization implies that conditional independencies in the distribution of observable variables are mirrored by a graphical notion of “path blocking” known as d-separation [Pea88], in a sense that any time two variable sets X, Y are d-separated by Z in a graph G induced by the underlying causal model , then X is conditionally independent from Y given Z in the distribution of variables in underlying causal model.

In stable [PV91] or faithful [SGS93] models, the converse is also true: conditional independence implies d-separation. Informally, in faithful models all probabilistic structure embodied by conditional independencies is precisely characterized by graphi- cal structure, there are no extraneous independencies due to “numerical coincidences” in the observable joint distribution. Faithfulness is typically assumed by causal dis- covery algorithms since it allows us to make conclusions about the graph based on the outcome of conditional independence tests. In addition, we assume that unobserved variables which are involved in the data generating process are jointly independent. Finally, since we are interested in learning causal, not just statistical models, we make the causal Markov assumption; that is, each variable Vi in our model is independent of all its non-effects given its direct causes [PV91].

Now the posterior probability for a feature f of interest can be computed by aver- aging over all the possible DAG structures which can be expressed as:

P Qn L(V |P a )π (P a ) P (f, D) Gj ∈G + i∈Gj i Vi i i P (f|D) = = f (3.3) P (D) P Qn L(V |P a )π (P a ) Gj ∈G i∈Gj i Vi i i

where Gf+ is the set of DAG structures where f occurs, G is the set of all the possible DAG structures, from i to n is nodes in Gj, and πi(P ai) is a prior over local structures. If we introduce an indicator function to equation (3.3), P (f, D) can be expressed as follows:

49 n X Y P (f, D) = fi(P ai)L(Vi|P aVi )πi(P ai), (3.4)

Gj ∈G i∈Gj where fi(P ai) is an indicator function with the value of 1 or 0. The function fi(P ai) can have the value 1, if the Gj has the feature f of interest (directed edge, adjacency or v-structure), or 0 otherwise. For the directed edge feature j → i, if j ∈ P ai then fi(P ai) will equal 1. For the v-structure feature a → i ← b, if a ∈ P ai, b ∈ P ai, a∈ / P ab, and b∈ / P aa then fi(P ai) will have 1. We can compute (3.3), if we know P (f, D) = P Qn f (P a )L(V |P a )π (P a ), since P (f, D) compu- Gj ∈G i∈Gj i i i Vi i i tation with f = 1 will gives us P (D). Since the number of possible DAGs is super-

n(n−1) exponential O(n!2 2 ), the brute-force approach of posterior computation of all the

n(n−1) possible graphical features takes O(n(n − 1)n!2 2 ) for all n(n − 1) directed edge

n−1 n(n−1) n−1  2  features and O(n 2 n!2 ) for all n 2 v-structure features, where n is the number of nodes in DAG. In the following sections, we will explain how to compute P Qn f (P a )L(V |P a )π (P a ) by dynamic programming more efficiently. Gj ∈G i∈Gj i i i Vi i i

For simplicity, we let Bi(P ai) = fi(P ai)L(Vi|P aVi )πi(P ai).

3.3 Previously Proposed Method

3.3.0.1 Posterior Computation of Directed Edge Features by Exploiting Root Nodes and Sink Nodes

In this section, we describe how to compute P (f, D) = P Qn B (P a ) for Gj ∈G i∈Gj i i all directed edge features using DP algorithm which was introduced in [TH09]. For the summation of Qn B (P a ) over all possible G by dynamic programming, we i∈Gj i i use the set inclusion-exclusion principle for computing unions of the overlapping sets. We first split DAGs into sets whose DAG elements have common root nodes or sink nodes. Then P (f, D) can be computed by union of those sets. Now we describe the

50 computation of P (f, D) by exploiting sets with common root nodes. Let V be the set of all variables of interest and let ζ+(S) be the set of DAGs over V such that all variables in V − S are root nodes. Then for any S ⊆ V , we define RR(S) as follows:

n X Y RR(S) = Bi(P ai). (3.5) G∈ζ+(S) i∈S If we apply the set inclusion-exclusion principle to RR(V ), then we can derive a re- cursive formula for the dynamic programming in terms of RR(V ) as follows:

First, we define the Ai(S) as follows:

X X Ai(S) = Bi(P ai) = fi(P ai)L(Vi|P aVi )πi(P ai). (3.6)

P ai⊆S P ai⊆S

Then Equation (3.5) can be rewritten as follows:

|S| X k+1 X Y RR(S) = (−1) RR(S − T ) Aj(V − S). (3.7) k=1 T ⊆S,|T |=k j∈T

Now if we see Equation (3.7)1, then RR(V ) can be recursively computed by dy- namic programming with base case RR(∅) = 1 and RR(j) = Aj(V − j). For each

Aj(S), with the assumption of fixed number of maximum parents k, we can com- pute it using the truncated Mo¨bius transform algorithm in time O(k2n) [KS04]. Since Pn n k n RR(S) for all S ⊆ V can be computed in k=0 k 2 = 3 , RR(V ) = P (fu→v,D) can be computed in O(3n) time for the fixed maximum number of parents. Therefore all n(n − 1) directed edge features can be computed in O(n23n) time.

Now we describe how to compute P (f, D) by computing union of DAG sets with common sink nodes. For all S ⊂ V , let ζ(S) denote the set of all possible DAGs over S. Then for any S ⊆ V , we define H(S) as follows:

51 X Y H(S) = Bi(P ai). (3.8) G∈ζ(S) i∈S If we apply the set inclusion-exclusion principle to H(S), then we can derive a recur- sive formula for the dynamic programming in terms of H(S) as follows:

|S| X k+1 X Y H(S) = (−1) H(S − T ) Aj(S − T ). (3.9) k=1 T ⊆S,|T |=k j∈T

Equation (3.9) 1 can be efficiently computed by dynamic programming. Each H(S) P|S| |S| |S|−1 can be computed in time k=1 k k = |S|2 . All H(S) for S ⊆ V can be Pn n k−1 n−1 computed in time k=1 k k2 = n3 . In the next section, we will explain how to combine H(S) and RR(S) to compute posteriors of all directed edge features which is a factor of n faster than DP algorithm using only RR(S).

Combining H(S) and RR(S) to Compute Posteriors of All Directed Edge Fea- tures

To compute the posteriors of all features efficiently, we extract the terms which are feature specific from the posterior computation. After we identify the terms which are common for all posterior computations, we can precompute those terms, then reuse them for all posterior computation. Let V be all the nodes in the DAG. For a fixed node v, summation over all DAGs can be decomposed into the set of nodes which are ancestors(U) of v, non-ancestors V − U − {v} and feature specific components. Now the computation of posterior amounts to the summation over DAGs over U, which corresponds to H(U) and the summation over DAGs over V − U − {v} with U ∪ {v} as root nodes, which corresponds to RR(V −{v}−U) and feature specific components which corresponds to Av(U).

1 The proof of Eq. (3.7) and (3.9) can be found in [TH09].

52 We define Kv(U) which corresponds to RR(V − {v} − U) for any v ∈ V and U ⊆ V − {v} as follows:

X |T | Y Kv(U) = (−1) RR(V − {v} − U − T ) Aj(U). (3.10) T ⊆V −{v}−U j∈T Then the posteriors of u → v can be computed by summation over all the subset of V − {v} of three terms as we mentioned above. We can express this as follows:

X P (fu→v,D) = Av(U)H(U)Kv(U). (3.11) U⊆V −{v}

In the Eq. (3.11), Av(U) is a feature specific term and H(U) and Kv(U) are feature independent terms. To compute the posteriors for all directed edge features, first we have to compute the feature independent terms which include Bi, Ai, RR, H, and Ki under the condition of f = 1. Since for each directed edge feature, fi(P ai) will have different values, we need to recompute Av(S), where S ⊆ V − {v}. With the assump- tion of a fixed maximum indegree k, the computation of feature independent terms

n takes O(n3 ) in time. And since each feature specific term Av(U) can be computed in O(k2n) time for all U ∈ V − {v}, it take O(kn22n) for all n(n − 1) directed edge features. Therefore total computational complexity is O(n3n + kn22n) = O(n3n) for all n(n − 1) directed edge features.

3.4 Novel Proposed Method

In this section, we describe a novel dynamic programming algorithm which computes n(n−1) n−1 the posterior probabilities of all 2 adjacency and n 2 v-structure features in O(n33n) time.

53 3.4.1 Novel DP Algorithm for Computing Posteriors of All Adjacency Features

The adjacency feature includes both direction of directed edge features. Therefore we can compute P (fu↔v,D) by simply adding joint probabilities for both directions.

(P (fu↔v,D) = P (fu←v,D) + P (fu→v,D)) Since the computational complexity for all adjacency features is same as that of all directed edge features, we can compute all n(n−1) n 2 adjacency features in O(n3 ) time.

3.4.2 Novel DP Algorithm for Computing Posteriors of All V-Structure Fea- tures

n−1 Now we describe a novel DP algorithm to compute posteriors for all n 2 v-structure features. A directed edge feature can be represented by one edge feature, while a v-structure feature requires two non-adjacent parent nodes and their common child. Therefore the computation of P (f, D) needs to be modified accordingly.

Among three terms in the equation (3.11), since Kv(U) term for v-structure feature computation is same as a directed edge feature computation, we need to modify only

Av(U) and H(U) terms from the equation (3.11).

For the computation of posteriors of v-structure, in the equation (3.6) of definition

Av(U), only the indicator function fv(P av) needs to be modified from the equation (3.6). For example, if the feature f represents the v-structure a → v ← b, since v-structure requires two parents a and b pointing to node v, the indicator function fv(P av) will have 1, if a ∈ P av, and b ∈ P av, or, 0 otherwise.

Now H(U) needs to be modified for the posterior computation of v-structures. H(U) represents the summation over the parents (U) of v. Let the v-structure feature we are testing be a → v ← b. Since this v-structure requires that two parents a and b should not be adjacent, DAGs with edges between a and b should be excluded from

54 the computation of H(U). For (n − 2) v-structures with fixed two parents a and b, there are significant overlaps in the posterior computation. If we compute the posterior probabilities of v-structures with same two parents nodes at once, we can exploit these overlaps to reduce the time and space complexity.

To this end, for all pairs of (a, b), and any S, where a ∈ V , b ∈ V and S ⊂ V , we define HV (S, a, b) as follows:

|S| X k+1 X Y HV (S, a, b) = (−1) HV (S − T, a, b) ATj(S − T, a, b), (3.12) k=1 T ⊆S,|T |=k j∈T where ATj(S − T, a, b) is (1)Aj(S − T − {a, b}) when j = a or j = b, (2) otherwise

Aj(S − T ). Equation (3.12) computes the summation over all DAGs over S except the DAGs which contain a → b or a ← b.

Then to apply our two modifications to equation (3.11), H(U) needs to be replaced by HV (U, a, b) for the v-structure a → v ← b. Now we can express P (f, D) for v- structure feature a → v ← b as follows:

X P (fa→v←b,D) = Av(U)HV (U, a, b)Kv(U). (3.13) U⊆V −{v} where a and b are the two non-adjacent parents for v-structure feature f that we are n−1 testing. Now we give the complexity analysis of P (f, D) computation for all n 2 v-structure features. To fully exploit the computation overlaps in the all v-structure posterior computation, for fixed two parents a and b, and all vi, where vi ∈ V , vi 6= a and vi 6= b, a, b ∈ V , all (n − 2) posteriors (P (fa→vi←b,D)) for all vi need to be computed at once. Once we compute HV (S, a, b) for all S ⊂ V with fixed two parents a and b, we can reuse those HV (S, a, b) for the computation of posteriors of all v- P|V | |V | |V | structure with parents a and b. This precomputation takes k=1 k = 2 in space. For a single pair of (a, b), where a, b ∈ U, and for all U ⊂ V , HV (U, a, b) can be

55 n−1 n(n−1) computed in O(n3 ) time. Therefore, all 2 pairs of (a, b) can be computed in 3 n−1 O(n 3 ) time. The computational complexity of Av(U) of v-structure is same as n−1 that of the directed edge case. So for n 2 v-structure features, Av(U) computation 3 n can be done in O(kn 3 ) time. Lastly computation complexity for Kv(U) computation is same as that of directed edge case, which is O(n3n) time. Therefore we can compute

n−1 3 n P (f, D) for all n 2 v-structures in O(n 3 ) time.

3.5 Experiments

In this section we empirically compare the effectiveness of the three approaches (In- ference with adjacencies + v-structures, inference with only directed edges, and the maximum likelihood approach) for discovering causal graphical features. In partic- ular, we focus on determining which of the three approaches allows us to choose an appropriate threshold for the existence of features, which is the key step for the dis- covery procedure. For this comparison, we have applied these three approaches to a synthetic data set generated in the following way.

The continuous-valued synthetic data sets were generated under the assumption that the generative causal model contains linear functions while unobserved variables are Gaussian. The samples of each variable vi which has k parents were thus generated using the following equation.

Pk j=1 αjP aj(vi) + N(0, 1) vi = q , (3.14) Pk 2 P j=1 αj +2 a

th th where P aj(vi) is the j parent of vi, αj is the causal coefficient of j parent to the vi, and N(0, 1) is the Gaussian noise with 0 mean and 1 standard deviation. We assume that these unobserved Gaussian noise variables are independent of each other. After adding all the effects from parents, we divide this variable by

56 q Pk 2 P j=1 αj + 2 a

1 π(G) = , (3.15) N df where N is the number of samples and df is the degree of freedom of given model G. The above structural prior has Markov equivalent and modularity property which meets our assumption, since our algorithm computes posterior probabilities by summing over all possible DAG structures, and Eq. (3.15) only depends on the number of samples and the degree of freedom, which amounts to the number of edges in G.

Now we want to compare three different prediction approaches for the task of iden- tifying graphical features. As we mentioned above, our hypothesis is that just predict- ing directed edge features under limited sample makes it difficult to determine a single appropriate threshold for existence of graphical features. To verify this hypothesis, we evaluated these approaches using data sets with various sample sizes. We randomly generated 100 causal diagrams with 5 variables. For each causal graphical model, we simulated samples under linear model with Gaussian noise as we explained above. Three different data sets were generated, containing 50, 100, and 200 samples.

We computed the posterior probabilities for all possible graphical features by three different approaches. For the directed edge approach, posterior probabilities for all n(n−1) directed edge features were computed. For adjacency + v-structure approach, n(n−1) n−1 posterior probabilities for all 2 adjacency features and all n 2 v-structure fea- tures were computed. For the ML-approach, we chose one model which had highest likelihood score among all the possible models. If the feature exists in the maximum

57 B C

D

Figure 3.2: (a) Directed edge approach (b) Adjacency + v-structure approach (c) Max- imum likelihood approach

58 likelihood model, its posterior is 1, otherwise 0.

Figure (3.2) shows the posterior probability distribution of positive features (fea- tures which exist in the data generating model) and negative features (features which don’t exist in the data generating model) with three approaches when 200 samples are generated. Due to space limitation, we only show the 200 sample case, but we ob- served similar results in the 50 and 100 sample cases. Figure (3.2) (a) represents the posterior distribution when applying the directed edge only approach. As can be seen in the upper and lower plots of Fig. 1 (a), there is no clean separation for posterior probabilities of edge features, which makes it difficult to choose a threshold. Figure (3.2) (b) represents posterior distribution, when adjacency + v-structure approach ap- plied. As Figure (3.2) (b) shows, the posterior of positive features are mostly greater than 0.9 and the posterior of negative features are mostly less than 0.5. We can clearly see that adjacency + v-structure approach has better distinguishing power than other approaches.

3.6 Conclusion

In this chapter we propose a more effective way of identifying the Markov equivalence class of the data generating model from observational data by computing posteriors of graphical features. The previous approach of computing posteriors of only directed edge features has the problem of deciding a single appropriate threshold due to the scale difference between directed edges forming v-structures and directed edges not forming v-structures. We claim that computing posteriors of both adjacencies and v- structures is necessary and more effective for discovering graphical features, since it allows us to find a single appropriate decision threshold for the existence of the fea- tures that we are testing. Empirical validation supports that adjacency + v-structure ap- proach is more effective than traditional directed edge only approach. For the efficient

59 computation, we provide a novel dynamic programming algorithm which computes

n(n−1) n−1 3 n all 2 adjacency and all n 2 v-structure features in O(n 3 ) time based on DP suggested in [TH09].

Reference to published article

Eun Yong Kang, Ilya Shpitser and Eleazar Eskin, “Respecting Markov Equivalence in Computing Posterior Probabilities of Causal Graphical Features”, in Proceedings of the 24th AAAI Conference on Artificial Intelligence (AAAI2010), Atlanta, GA: July 11-15, 2010.

60 CHAPTER 4

Increasing Association Mapping Power and Resolution in Mouse Genetic Studies through the Use of Meta-analysis for Structured Populations

4.1 INTRODUCTION

Model organisms have long played a pivotal role in the research of human diseases. The use of mouse models in particular has been extremely effective for the the identi- fication of genes underlying Mendelian disorders. The traditional mode of discovery used to identify loci underlying these disorders has been the F2 cross. In an F2 cross, two inbred mice are used to produce F1 progeny and then these progeny are crossed to obtain F2 mice, each of which have a genetic structure that is a mix of the two original inbred strains. By applying linkage analysis to F2 populations, regions har- boring causal variants are identified with high statistical power. Unfortunately, these approaches have had limited success in identifying genetic variations underlying com- plex, polygenic traits due to the low resolution of the study [FM01, BFO10a], meaning that the regions found to harbor causal variants are very large.

As an alternative to F2 mapping, a number of groups have proposed the use of GWAS methodologies to map traits in inbred populations [PMB04, Pay07]. Such ap- proaches result in increased resolution, as inbred strains have a more diverse genetic structure, in which only small portions of the genome are shared between any two

61 strains. The initial results were promising, but it was later found that the significant population structure within inbred strains causes a large number of spurious associa- tions and inflates the significance of true associations. Upon correction for population structure, most of the associations identified as significant were found to be spurious [KZW08, MGP09]. Also, when corrected for population structure, existing panels of classical inbred strains were under powered to detect genetic variants explaining less than 10% of the phenotypic variation. In order to address these issues, Bennett et al. (2008) [BFO10a] utilized a panel of mice called the Hybrid Mouse Diversity Panel (HMDP), which combines inbred strains with recombinant inbred (RI) strains, which resemble an inbred version of an F2 cross. The idea is that inbred strains provide high resolution, while RI strains provide increased power. They showed that when perform- ing association mapping within this panel they achieved higher resolution than when performing mapping only using RI strains and showed that they achieved higher power than when performing mapping with only inbred strains. However, the power to detect small effects remains quite low, a problem that is due to an inherit limitation in the design of the HMDP: the limit on the availability of inbred strains.

Limited power and resolution are noted problems in many mapping panels and in order to combat these issues, a number of groups have suggested methods to combine the results from multiple studies [HDK00, HMC02, PBL07, LLW05]. The core con- cepts behind these methods, all of which are formed on linkage-based methodologies, may be adapted to work in association analysis. However, such approaches may not be well-suited for studies in structured populations. For example, a shared feature of these linkage-based methods is the attribution of equal informativeness to each study. Such an assumption may not hold in studies with population structure, as the infor- mativeness of a given panel will be locus-dependent. In this case, methods attributing equal weight to each population may result in sub-optimal power.

62 In this chapter, we propose a method to combine studies in a locus-specific manner, weighting each study relative to its level of informativeness, and show that our method achieves optimal power within the proposed framework. Our method is based on the concept of meta-analysis. In a meta-analysis, the statistics obtained for each SNP in two separate studies are used to obtain a meta-statistics that combines information from these studies. The most common methods for performing meta-analysis are based on the fixed effect weighted sum of Z-scores (WSoZ) [BFJ08], in which Z-scores from each study are combined using a pre-defined weighting scheme. Typically, weights are set as proportional to the number of individuals in the study. Using this basic idea, we propose a meta-analysis method for combining the results obtained from mapping in the HMDP with those obtained from mapping within an F2 population. Since the best way to combine results from these two populations at a given SNP is dependent on the strain distribution pattern in each population at that SNP, current meta-analysis methodologies are not well suited. We introduce a method that accounts for the genetic structure within each population when combining results. Using a mixed-model-based approach to correct for population structure, we derive a meta-statistic based on the WSoZ. By applying an optimal weighting scheme, our method achieves both higher power and increased resolution over mapping performed only within one population. We note that the HMDP is only one of several recently proposed strategies for increas- ing the resolution of mouse genetic studies over traditional crosses. Other strategies include the collaborative cross [AVF11], and the use of heterogeneous stocks [HSV09]. The meta-analysis method we introduce is flexible and may be used to combine studies conducted within these panels as well.

We evaluate our method through simulation and by applying it to real phenotype data for which previous discoveries have been made. First, we evaluate both power and resolution through a simulation framework. We find under many different settings that the meta-analysis approach results in higher power than either single panel. We also

63 find that when applying the meta-analysis approach, resolution is increased 1.5-fold with respect to the HMDP and 3.5-fold with respect to an F2 panel. Next, we apply the meta-analysis approach to map bone mineral density (BMD), which was measured from the femurs of 865 HMDP mice and 161 F2 mice, a cross between C57BL/6 and C3H [FBO11]. In our results, two previously implicated loci are recovered with increased significance. Deriving support intervals using the span of significant SNPs, we find that our method results in increased resolution over results obtained through linkage analysis. Finally, we apply our method to map HDL cholesterol in 687 HMDP mice and 164 F2 mice [NGW09] and find that a gene (Apoa2 [WHQ93]) known to be associated is identified with increased significance.

4.2 MATERIALS AND METHODS

4.2.1 Association Studies

Let us assume that we have measured a phenotype within a population i that contains ni individuals. We denote the ni × 1 column vector of phenotype measurements as

0 yi = [yi1 yi2 . . . yini ] . In order to test the association between the phenotype and a given SNP r, we test the hypothesis β = 0 under the model in equation (4.1), where

µ is the global phenotype mean and xi is a vector of minor allele counts of SNP r for individuals in population i.

yi = µ + xiβ +  (4.1)

A test statistic for testing β = 0 is derived by noting the distribution of the estimate of β under the assumption of normality. We denote the estimate of β in population i ˆ ˆ 2 2 as βi, where βi ∼ N(β, si ) and si denotes the squared standard error of the estimate in population i. The z-score statistic for SNP r in population i, Zi, is given in equation

64 (4.2) and may be used to test the hypothesis β = 0 or may be used in order to derive other statistics, such as a chi-square or F-statistic.

ˆ βi Zi = (4.2) si

Traditional Meta-Analysis

Most of traditional methods for meta-analysis employ the weighted sum of z-scores (WSoZ) [SRC09] [WSL09] [ZSS08]. In this method, a meta statistic for each SNP is calculated using equation (4.3), where wi denotes a weight given to each Z-score for a population i.

P w Z Z = i i i (4.3) meta pP 2 i wi

The weights, wi, are often a function of the sample size of their respective popula- tion, so that larger population samples obtain a higher weight [BFJ08]. This weighting scheme make sense intuitively as we may want to attribute greater confidence to stud- ies with more individuals. Alternatively, weights are set as the inverse of the standard error of the estimate of the beta coefficient, so that wi = 1/si. The resulting meta statistic is the so-called pooled inverse variance-weighted beta coefficient [BFJ08]. As has been done for case-control studies [ZE10], it is possible to show that this particu- lar weighting scheme is optimal in the sense that these weights maximize the power of detecting an effect of size β. ˆ Given the distribution of βi, we have that when β 6= 0, Zmeta ∼ (λ, 1), where P β pP 2 λ is a non-centrality parameter with λ = wi / w . λ is maximized, when i si i i wi = 1/si, thus meta-statistic has optimal power to detect an effect of size of β. The optimality of the weight ( wi = 1/si) is shown by using the Cauchy-Schwarz inequality

65 q P β pP 2 P β 2 ( wi ≤ w ( ) ). Under the assumption that β is the same across all i si i i si populations, equality holds when wi = 1/si

4.2.2 Association Studies in Structured Populations

Although the traditional approach to association mapping is often used, there are a number of issues that arise when performing this basic analysis. One problem is that of population structure or cryptic relatedness [DRB01, VP05], in which genetic simi- larities between individuals both inhibit the ability to find true associations and cause the appearance of a large number of false or spurious associations. Mixed effects mod- els are often used in order to correct this problem [Lan02, YPB06, KZW08, KSS10]. Methods employing a mixed effects correction account for the genetic similarity be- tween individuals with the introduction of a random variable into the traditional model from equation (4.1).

yi = µ + βxi + ui +  (4.4)

In the model in equation (8.8), the random variable ui represents the vector of genetic contributions to the phenotype for individuals in population i. This random

2 variable is assumed to follow a normal distribution with ui ∼ N(0, σg Ki), where Ki is the ni ×ni kinship coefficient matrix for population i. With this assumption, the total

2 2 variance of yi is given by Σi = σg Ki + σe I. A z-score statistic is derived for the test ˆ β = 0 by noting the distribution of the estimate of βi. In order to avoid complicated notation, we introduce a more basic matrix form of the model in equation (8.8), shown in equation (8.9).

yi = XiΓ + ui +  (4.5)

66 In equation (8.9), Xi is a ni × 2 matrix encoding the global mean and SNP vectors and Γ is a 2 × 1 coefficient vector. We note that this form also easily extends to models with multiple covariates. The maximum likelihood estimate for Γ in population i is ˆ 0 −1 −1 0 −1 given by Γi = (XiΣi Xi) XiΣi yi which follows a normal distribution with a 0 −1 −1 mean equal to the true Γ and variance (XiΣi Xi) . The z-score statistic for testing β = 0 is then given in equation (4.7), where R = [0 1] is a vector used to select the ˆ appropriate entry in the vector Γi.

0 −1 −1 0 −1/2 ˆ Zi = [R(XiΣi Xi) R ] RΓi (4.6) −1/2 ˆ = Qi RΓi (4.7)

0 −1 −1 0 Qi = [R(XiΣi Xi) R ] (4.8)

4.2.3 Meta-Analysis in Structured Populations

In order to perform meta-analysis using multiple structured populations, we adopt the weighted sum of z-scores approach shown in equation (4.3), where the z-score for population i is given in equation (4.7). When β 6= 0, Zmeta will have a normal dis- P −1/2 pP 2 tribution with variance 1 and mean wiQi R / wi . Again we employ the use of the Cauchy-Schwarz inequality, shown in equation (4.9), to show that the opti- −1/2 mal weights are given by wi = Qi . This result may also be obtained by noting that −1/2 Qi from equation (4.7) is the mixed-model equivalent to s from section (Traditional Meta-Analysis). However, this result is more general, allowing for a more flexible hy- pothesis testing framework in which any linear combinations of the elements of Γ may be evaluated.

q X −1/2 pP 2 P −1/2 2 wiQi RΓ ≤ wi (Qi RΓ) (4.9)

67 Substituting the optimal weights we arrive at the final meta statistic given in equa- tion (4.10) with its distribution under the alternative hypothesis given in equation (4.11).

P Q−1RΓˆ √ i i Zmeta = P −1 (4.10) Qi pP −1 ∼ N(β Qi , 1) (4.11)

It should be noted that when Σi is unknown, it must be estimated from the data.

In this case, Zmeta may not follow a standard normal distribution under the null, due to the unaccounted uncertainty in the estimation of Σi. However, we are able to side step this issue by using a global search technique [KZW08, KSS10], in order to find an optimal estimate of Σi for each population.

Simulations

Simulations were performed using a previously designed framework [KKW10b, BFO10a]. For both power and resolution, phenotypes were generated by sampling a phenotype

2 for each strain while assuming the model from equation (8.8). The genetic variance σg was determined for a given genetic background (g2) by using equation (4.12), where

S = In − 1/nJn (Jn is an n × n matrix of ones).

2 2 2 g σe (n − 1) σg = 2 (4.12) (1 − g )T r(SKi) The power and resolution for each effect size (β) was determined by first applying the association mapping procedures to each simulated phenotype. Power was calcu- lated as the percentage of associations for the known causal SNP that reached signifi- cance. For resolution, association was applied to each SNP on the same chromosome as the causal SNP. The distance between the true causal SNP and the peak associations

68 were recorded. If the peak association is greater than 15Mb away from the causal SNP, then the value is recorded as 15Mb. This procedure helps to reduce the mean shift that occurs because of low power within a region.

Significance Threshold Estimation

Significance thresholds were estimated for each method using a technique utilized previously [KKW10b, BFO10a]. Ten-thousand null phenotypes were generated and association statistics were calculated for each phenotype over all SNPs. We selected the minimum p-value for each phenotype, resulting in a set of 10,000 minimum null p-values. The threshold was chosen by selecting the p-value for which only 5% of the minimum p-values were smaller. This p-value then represents our threshold controlling for 5% FDR. Thresholds for the HMDP, F2 and Meta-analysis approach were found as follows: 3.715 × 10−6, 2.4637 × 10−4 and 2.7 × 10−6.

4.2.4 Mouse Association Data

Genotypes for the F2 cross were obtained from a previous study [ECW04, WYS06, NGW09, FBO11]. The original cross contained 311 mice, but we randomly sampled only 300 for our simulation studies. Each mouse was genotyped at about 1200 markers spread across the genome and it was this set of markers which was used previously to perform linkage analysis. In order to apply the meta-analysis approach outlined in this chapter, we require that the F2 mice be typed at the same markers as the HMDP. Fortunately, since the parental strains for the F2s are part of the HMDP, genotyping is not necessary. Instead we perform imputation in order to determine the state of each marker which is typed in the HMDP but is not part of the markers typed in the F2 cross. By applying the imputation algorithm described below, we obtained a set of 113,650 SNPs which were polymorphic in both the HMDP and the F2 cross. This is compared

69 to the total set of markers available for the HMDP, which is of size 132,285.

We utilize a straightforward approach to imputation by noting the simple structure of the F2 genomes. For any two adjacent markers in a given F2 mouse, the state of the intervening markers will be determined by the state of the two adjacent markers. Let two adjacent markers be xi and xi+k, where k is the number of intervening markers.

If both xi and xi+k are in the same state as parent one, then the markers from xi+1 to xi+k−1 will be set to be the same as the corresponding markers in parent one. Likewise, if both xi and xi+k share the same state as parent two, the intervening markers will be set to those from parent two. If there is a switch in state between the two adjacent markers, this indicates a recombination. In this case, we are not able to determine the state of the intervening markers and these will be labeled as unknown.

4.3 RESULTS

4.3.1 Combining the HMDP with an F2 cross increases power

We show that by combining the mapping results obtained in the HMDP with those obtained in an F2 cross through meta-analysis, we achieve higher power than when mapping within only one panel. Simulations are performed with genotypes for 300 F2 mice, which were obtained from a previously generated cross [ECW04, WYS06, NGW09]. The F2s were genotyped at about 1200 markers and imputation was per- formed (see Methods) to obtain genotypes at all markers typed in the HMDP strains.

Power simulations were performed as described in previous studies [KKW10b, BFO10a]. We randomly selected a set of 10,000 SNPs that are polymorphic in both the F2 cross and the HMDP. For each SNP we generated a phenotype with a 25% genetic background effect and a SNP effect of a given size. The genetic background effect can be thought of as the heritability of the trait. Association between each SNP

70 and its corresponding set of generated phenotypes was tested using EMMA [KZW08] for the F2 and HMDP panels alone. Power for each SNP effect size was calculated as the percentage of tests that resulted in a significant p-value. Significance thresholds for each panel were obtained through a parametric bootstrap procedure.

Figure 4.1 shows the comparison of power between the meta-analysis approach and mapping within the individual panels. In these simulations, we varied both the number of F2 mice as well as the number of HMDP replicates. Power is reported on the y-axis and the magnitude of the SNP effect is reported on the x-axis. The SNP effect is reported in terms of β from equation (8.8) and the actual variance explained for a given value will depend on the SNP as well as the genetic background. Therefore, we determine the variance explained by a given effect size under a given genetic back- ground by taking the average variance explained in the HMDP across all SNPs. The meta-analysis method has higher power than mapping within the single populations in all simulations. As power within each of the single populations increases, so does the power of the meta-analysis method. For a large number of F2 mice and HMDP mice, the power to detect small effects increases dramatically by applying meta-analysis. For example, for a SNP effect accounting for 5% of the phenotypic variance (β = 0.5), we find that mapping within only the HMDP with 5 replicates results in a 50% power, while mapping within only the F2 cross results in a power of 17%. When combining the results through meta-analysis, the power increases to 75% (Figure 4.1D).

4.3.2 Meta-analysis leads to an increase in resolution over HMDP and F2 map- ping

We evaluate the mapping resolution when using the HMDP, F2 and the meta-analysis approaches through simulation. Resolution was evaluated by calculating the genetic distance between a SNP simulated to be causal and the peak associated SNP, while

71 only considering the region within 15Mb of the causal SNP. Figure 4.2 compares the distribution of these distances under each mapping method. Simulations were per- formed assuming a 25% genetic background effect and a SNP effect accounting for 10-15% of the phenotypic variance.

Using one replicate for the HMDP, we find that the mean distance of the peak as- sociation to the true causal SNP is 3.17Mb. This compares with a mean of 7.5Mb obtained when mapping within the F2 panel. When combining results through the meta-analysis approach, the mean distance is decreased to 2.21Mb. This is an almost 1.5-fold increase in resolution over the HMDP and an almost 3.5-fold increase in res- olution over the F2 panel.

4.3.3 Application to Bone Mineral Density

We obtained a set of bone mineral density (BMD) measurements from the femurs of 865 HMDP mice and 161 male F2 mice. We applied association mapping in each panel separately using EMMA [KZW08], and applied the meta-analysis approach as well. Manhattan plots summarizing these results are shown in figure (4.3). Two loci (Chr 4 and 7) showed an increase in significance relative to the associations in either the F2 or HMDP. The significance of the Chr. 7 meta-analysis peak was an order of magnitude more significant (3 × 10−7) than either the HMDP (3.1 × 10−6) or the F2 (1.6 × 10−3) peaks. The original QTL on Chr. 7 (Bmd41) had a 1.5 LOD support interval of 80 Mb (24.9 to 104.9 Mb) [FNG09]. The Chr. 7 meta-analysis SNPs with P-values of ≤1e-6 extended from 17.2 to 25.2 Mb, representing a significant improvement in resolution for Bmd41. The QTL on Chr. 4, previously referred to as Bmd7 [FNG09], was the strongest locus affecting femoral BMD in the F2 (p = 7.8 × 10−4). The peak F2 SNP was moderately significant in the HMDP (1.3 × 10−3) and highly significant in the meta-analysis (2.8 × 10−6). Bmd41 was previously found to have a 1.5 LOD support

72 interval of 11.0 Mb (126.2 to 137.2 Mb). In the meta-analysis SNPs with P-values of ≤1e-6 spanned 10 Mb (from 129 to 139 Mb).

4.3.4 Application to HDL cholesterol

We obtained a set of HDL measurements for 687 male mice each a member of the HMDP and a set of 164 male F2s [NGW09]. We applied association mapping in the HMDP and F2 panels separately using EMMA [KZW08] and then applied our meta- analysis approach. Figure 4.4 shows the results of this experiment. As shown in the original paper introducing the HMDP, the peak association for HDL is found on distal chromosome one, in which a well-known association with the Apoa2 [DLW90] gene exists. The peak association is 25kb upstream of the start site of the Apoa2 gene with a p-value of 7.06×10−8, which is significant at the 1×10−7 level estimated from a para- metric bootstrap procedure. The mapping results obtained from the F2 panel(Figure 4.4A) resembles a linkage peak, due to the large amount of linkage disequilibrium within the F2 genomes. The peak association identified in the F2 population is over 2Mb downstream of the end site for the Apoa2 locus with a p-value of 2.67 × 10−09. Figure 4.4B shows the mapping result obtained with the meta-analysis procedure. Us- ing the meta-analysis result, we again obtain the association which is 25kb from the start site of the gene, however the p-value is greatly reduced to less than 1 × 10−15.

4.4 DISCUSSION

In this chapter, we introduce a study design in which the Hybrid Mouse Diversity (HMDP) inbred panel is combined with an F2 cross in order to perform association mapping. We show that by utilizing a meta-analysis approach which accounts for the genetic structure of the populations, both association power and resolution are in-

73 creased when compared with mapping within either of the individual panels. The rea- son for increased power can be understood intuitively as, in general, increased sample sizes lead to increases in power. However, an increase in resolution when combining a high resolution panel with a low resolution panel is somewhat counter intuitive. One way to understand why we achieve higher resolution is by considering that by combin- ing panels we are increasing the number of overall unique genomic break points.

Our results have focused on the case when the HMDP panel is combined with one F2 cross. However, by using the methodology we present any number of panels can be combined. One obvious potential for this is that by adding additional F2 panels, we may increase power much further. A significant amount of cross data exists in publicly accessible databases such as MGI [BBK11]. By utilizing existing cross data researchers will be able to use our technique, in order to increase the power of their studies without spending money to generate F2s of their own.

Another advantage of our method is that it is general enough to be used in order to combine the HMDP with other types of study designs such as the Collaborative Cross [AVF11] and heterogeneous stock [HSV09]. However, one potential issue that may arise when combining the HMDP with such panels is that of heterogenetiy of effect size. That is, the magnitude of main effects may vary between different mapping panels due to the difference in the overall genetic structure. In this case, our method may be easily extended to utilize approaches which account for such heterogeneity between effects [HE11a]. Heterogeneity between effect sizes is also known to be a problem between sexes within the same population. Therefore, a similar approach may be utilized in order to combine results across sexes within the same mapping panel.

74 LISTOF FIGURES

75 4.5 FIGURES

Reference to published article

Nicholas A. Furlotte*, Eun Yong Kang*, Atila Van Nas, Charles R. Farber, Aldons J. Lusis and Eleazar Eskin, “Increasing association power and resolution in mouse ge- netic studies through the use of meta-analysis for structured populations” in Genetics, April 2012.

76 HMDP HMDP 1.0 1.0 F2 F2 META META 0.8 0.8 0.6 0.6 Power Power 0.4 0.4 0.2 0.2 0.0 0.0

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 SNP Effect SNP Effect

(A) 100 HMDP, 100 F2s (B) 200 HMDP, 200 F2s

HMDP HMDP 1.0 1.0 F2 F2 META META 0.8 0.8 0.6 0.6 Power Power 0.4 0.4 0.2 0.2 0.0 0.0

0.0 0.5 1.0 1.5 2.0 0.0 0.5 1.0 1.5 2.0 SNP Effect SNP Effect

(C) 300 HMDP, 300 F2s (D) 500 HMDP, 300 F2s

Figure 4.1: Mapping power is increased by combining populations using meta- analysis We performed simulations assuming a background genetic effect of 25%. Power was calculated as the percentage of associations detected at a given level of significance for SNPs simulated to be causal. The meta-analysis method is shown to provide increased power at all effect sizes.

77 Resolution of each method for 25% genetic background and 10−15% SNP effect

7000 Meta HMDP F2 6000 5000 4000 Counts 3000 2000 1000 0

1 3 5 7 9 11 13 15 Distance from causal SNP (Mb)

Figure 4.2: The combined mapping result has higher resolution than that of the HMDP or F2 mapping. The distribution of distances from the true causal SNP to the most significant association are shown in units of megabases. We considered peak associations which are within 15mb of the true causal SNP. As expected, the HMDP has much higher resolution than the F2 cross. The combined result achieves an even higher resolution.

78 Figure 4.3: Meta-analysis results in increased significance and increased resolu- tion for two loci known to be associated with BMD. Two loci, one on chr 4 and one on chr 7, were previously found to be associated with BMD (Bmd7 and Bmd41 respectively) [FNG09]. After applying meta-analysis, we found that the peak asso- ciated SNPs for both of these loci had increased significance with respect to the F2 and HMDP mapping panels. We also found that using significant SNPs to define in- tervals within these loci, resulted in tighter confidence intervals, indicating increased resolution.

79 HMDP

12 F2 10

●●●● 8 ●●●●●●●●●●●●● ●●●● ●●● ●● ● ● 6 ● ●● ● ●●●●●● ● ●● ● ● ●●●●●●●● ● ●● ●●● ●● ●●● ● ●●●●●●● ● ●● ●●●

4 ● ● ●●● ● ●●● ●●●● −log10(p−value) ●●● ● ● ●●●●● ● ●●● ● ●●● ● ●● ●●●●●● ●●● ● ●● ●●●●● ●● ● ●● ●●● ●●●● ● ● ● ● ●● ●●●●●● ●●●● ●●●●●●●● ●● ●● ● ● ●● ● ●●● ● ● ● ● ●●●●●●●● ● 2 ● ●● ● ●●● ● ● ● ● ● ●●● ● ●● ● ●●●●●●●●●●●●●●●●●●●●●●● ●●●●●● ●●●●●●●●●●● ● ● ●● ● ● ● ●● ●●● ●●● ●● ●● ● ●●●●● ● ●●●●●●● ●●●●●●●● ●●● ●●● ●● ● ● ● ● ●● ●● ● ● ●●●●●● ● ●●● ●●●● ●● ●● ●●●●●● ● ●●●●●●●● ● ●●●●●●●●● ●●●● ● ●●●●●●●●● ●●●●●●●●●●●●●●● ● ●●● ●●●●●●●●●●●●●●●●●●●●● ● ●●●● ●●● ●●●●● ●●●●●●●● ●●●●●●●●● ● ●●●●●●● ●●●●●●●●●●●●●●●● ●●● ●●● ● ●●●●●●●●●●● ●●●●●● ●●●●● ● ●● ●● ●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● 0

0 50 100 150 200 Position on Chr 1 (Mb)

(A) Association for HDL in the HMDP and F2 cross

● ●●● ● ● ●● ● ●● 12 ● ●● ● ● ● ●●●● ●● ●● ● ●● ●● ● ● ●● ● ●● ● ●● 10 ● ●●● ●●● ● ● ● ●● ●● ●●●● ●●●●● ● ●●●● ●● ●●●●● ●● ●● ●●●●●● 8 ●● ●●●● ● ●●●● ●● ●●●●● ●● ●●● ● ●●●●● ● ●● ●●●●●●●● ●●●●●● ●● ● ● ●●●●● ●● ● ●● ● ●● ● ●●● ●●●●●●●●● ●● 6 ●● ●●●●●●●● ● ●● ● ●●●●●●●●●● ● ● ●●●●●●●●● ● ●●●● ● ●●●●●●●● ● ● ●● ●●●●●●● ●● ● ●●● ● ●●● ●● ● ●●●● ●● ● ●● ● ●●● ● ● ●●● ●●● ●● ● ●●●●● 4 ●●●●●● ● ●● ● ●●●●●● ● ● ● ● ●●●●●● ●●● −log10(p−value) ●●● ● ● ●● ●●●●● ● ●●●●●●● ●●● ●●●●● ●●● ●● ●●● ●●●●● ●●●● ●●● ●● ● ●●●● ● ●●●●●●●● ●● ● ●● ●●● ● ●● ●●●●●●●●●●●●●● ●●●●●● ●● ●●● ● ● ●●●●●●●●●●●●●●● ●● ● ●● ● 2 ●● ●●● ●● ● ● ●●● ● ●●●●●●●●● ●●●●● ● ●●●●● ● ● ●● ● ●●●● ●● ●●●●●●●●● ●●●● ● ● ●●●●●●● ●●●●●●●● ●●● ● ● ● ●●● ●●●●●●●●● ● ●●●●●● ● ●●●●●●●●●●● ●● ● ●●●●● ●●●● ● ●●●● ●●● ●● ●●●●●●●● ● ●● ● ●●●●●●●●●●●●●● ●●● ●● ●● ● ● ●●●●●●● ●●●●●●● ● ●●●● ● ●● ●●●●●●●●●●●●●●●●●●● ●●●●●● ● ●● ●● ●●●● ●● ●●● ● ●●●●● ●●●●●●●●●● ●●●● ● ●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●●●●● ● ●●● ●●●●●●●● ●● ●●●●●●● ● ●●●● ●●●● ●●●●●●● ●● ●●●●●●●● ●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●● ●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●● ●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●●●● ●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●●● ●●●● ●●● 0

0 50 100 150 200 Position on Chr 1 (Mb)

(B) Meta Association for HDL

Figure 4.4: Meta-analysis increases significance of known association. We compare the association mapping results obtained from the HMDP, F2 cross and meta-analysis for HDL cholesterol on chromosome 1. We have previously shown that the Apoa2 gene underlies the chromosome 1 HDL locus [WHQ93].

80 CHAPTER 5

Fast Pairwise IBD Association Testing in Genome-wide Association Studies

5.1 Background

Identity by descent (IBD) is a fundamental concept in genetics. Two individuals are IBD at a locus if they have identical alleles inherited from a common ancestor. In- vestigators have put tremendous efforts to map the IBD segments between purportedly unrelated individuals [PNT07, GLS09, BB10]. The current state-of-the-art methods such as GERMLINE [GLS09] and Beagle [BB10, BB11] can detect even small (sev- eral megabases) IBD segments between individuals.

One promising application of IBD mapping is to use discovered IBD segments in the association testing [PNT07]. Investigators usually test single nucleotide polymor- phisms (SNPs) for association, but SNPs may not “tag” low frequency causal vari- ations well [BYP05]. Imputation also performs poorly on rare variations [MHM07, BB09]. Association testing based on the IBD information, or IBD association testing, can complement standard association testing methods [BB11].

There are two categories of methods in IBD association testing. The first method is the pairwise method [PNT07] where one compares the IBD rate of case/case pairs to to the background IBD rate to detect excessive IBD between cases. The rationale is that if a rare causal variations has occurred at a relatively recent ancestor, cases will

81 likely share an IBD segment containing the causal variation. The second method is the clustering method [GKL11] where one divides individuals into clusters based on the IBD information and then test each cluster for association assuming the cluster “tags” a rare causal variation. In this chapter, we focus on the pairwise method.

The pairwise method have two computational challenges. The first challenge is computational efficiency. In the pairwise method, one uses permutation to approxi- mate p-values because it is difficult to analytically obtain the asymptotic distribution of the statistic. Since the p-value threshold for genome-wide association studies (GWAS) is necessarily low due to multiple testing [BT12], one must perform a large number of permutations which can be computationally demanding. The second challenge is fine-mapping. After one identifies a significant loci, it is important to pinpoint the most significant peak within the loci to further follow-up candidate genes. The per- mutation is limited for this purpose because the smallest p-value it can approximate is constrained by the number of permutations, often resulting in many SNPs having the same minimal p-values in the region.

In this chapter, we present a new method, Fast-Pairwise, to overcome the computa- tional challenges of the traditional pairwise method. Fast-Pairwise utilizes importance sampling [Was04] to improve efficiency and enable approximation of extremely small p-values. To devise an importance sampling procedure, we introduce a new statistic that has two properties; it can be conveniently sampled and it can approximate the pairwise method statistic. We show that the new statistic has close relationship to the pairwise method statistic through the properties of the graph representation of IBD.

Fast-Pairwise is efficient and takes only days to complete a genome-wide scan. To demonstrate the utility in fine-mapping, we apply our method to the type 1 diabetes dataset of the [Wel07]. In this dataset, the traditional pairwise method can identify a significant region in chromosome 6 [BT12], but it gives the same minimal p-value

82 IBD information A E (pairs of individuals that are IBD) 1 (A,B) B C F G 2 (A,C) IBD graph 3 (B,C) D H 4 (D,F) 5 (G,H) Case Control

Figure 5.1: An example of IBD graph. IBD detection method provides IBD informa- tion (Table). Then we build a graph where vertices are individuals and edges are IBD relationships. for a very wide region (26.7Mb-35.5Mb) including all eight classical human leukocyte antigen (HLA) genes. Among these, Fast-Pairwise pinpoints HLA-DQB1 which is known to cause the disease [TBM87].

Fast-Pairwise is publicly available at http://genetics.cs.ucla.edu/graphibd.

5.2 Methods

5.2.1 IBD graph

Given N individuals, the IBD information at a genomic locus can be represented as a graph with N vertices (Figure 5.1). An edge exists between a pair of vertices if the individuals are IBD.

83 5.2.2 Pairwise methods for IBD association mapping

We refer to a class of IBD association mapping methods as “pairwise methods" if they examine the relative numbers of edges in the IBD graph at each locus. There are three different types of edges: edges that connect two case individuals, edges that connect two control individuals and edges that connect a case and a control individual. Pairwise methods can be performed in two different ways. One way is to compare the number of case/case pairs to control/control pairs. A second way is to compare the number of case/case pairs to non-case/case pairs (union of control/control pairs and case/control pairs). We will call the first method CC and the second method CN. In this chapter, we mainly focus on CN consistent with previous studies [BT12, PNT07]. If we simply refer “pairwise method”, it refers to CN.

Suppose that we have N + cases and N − controls (N + + N − = N). Let V + be the set of case vertices and V − be the set of control vertices. Let E++ be the set of all possible case/case vertex pairs, E−− be the set of all possible control/control vertex

+− pairs, and E be the set of all possible case/control vertex pairs. Let eij be 1 if there exists an edge between vertices i and j and 0 otherwise. Then, the CC and CN statistics are defined

SCC = IBDd case/case − IBDd control/control X eij X eij = − N + N − (i,j)∈E++ 2 (i,j)∈E−− 2

SCN = IBDd case/case − IBDd non-case/case X eij X eij = − N + N − + − (i,j)∈E++ 2 (i,j)∈E−−∪E+− 2 + N N The asymptotic distributions of these statistics are difficult to obtain analytically. This is because the statistics are based on the edges that depend on each other. For this

84 Null hypothesis Alternative hypothesis −5 −4 10 10 × × 0 3 statistic CC −5 S 10 0 4 × −3 −3 × 10−5 0 3 × 10−5 0 4 × 10−4

SCN statistic SCN statistic Figure 5.2: High correlation between CC and CN statistics. We simulated 1,000 stud- ies under the alternative hypothesis (See Section 5.3.2). We then permuted phenotypes to simulate the null hypothesis. Spearman ρ is 0.89 and 0.99 under the null and the alternative respectively. reason, statistical significance is assessed by permutation. We assume a one sided test where IBD segments carry variants that are involved in disease [BT12].

The relationship between CC and CN is worth noting. Under the condition that the background IBD rates of control/control pairs and the non-case/case pairs are equiva- lent (IBDcontrol/control = IBDnon-case/case), CN will be more powerful than CC due to the additional N +N − pairs it considers. We expect that, however, the relative ordering of the two statistics are similar to each other (Figure 2), since most of the pairs for both CC and CN are the same. As we will show below, we will utilize this similarity as the basis of increasing the computational efficiency of computing SCN .

85 5.2.3 Permutation test

Permutation is the standard approach for assessing significance in the pairwise method. A single permutation can be thought of as randomly sampling a vector of case/control disease status. Let

v = (v1, ..., vN ), ∀vi ∈ {0, 1} be the vector of disease status of N individuals where 0 denotes control and 1 denotes case. The test statistic of pairwise method, SCN, is a function of v. Let ˆv be the case/control status that was originally observed in the data. The standard permutation test is equivalent to sampling new v from all possible permutations of ˆv assuming a uniform distribution. Let B be the set of sampled v. The estimated p-value is

1 X pˆ = δ(S (v) ≥ S (ˆv)) (5.1) |B| CN CN v∈B where δ is the indicator function.

The drawback of the permutation test is that it is computationally inefficient. If the true p-value is very small, which is required in genome-wide studies, we will need a large number of permutations. For example, to assess a p-value p with standard error p/10, one needs approximately 100/p samples. For the genome-wide threshold of IBD association testing (6×10−6, [BT12]), more than 10 million permutations are required.

5.2.4 Fast-Pairwise

We develop a new method Fast-Pairwise that uses importance sampling technique to speed up CN method [Was04]. Unlike the standard permutation test that sam- ples case/control status v from the uniform distribution, we sample v non-uniformly.

Specifically, we aim to sample v from all permutations of ˆv such that SCN(v) will be similar to SCN(ˆv) on average. The intuition is that by intentionally sampling v that

86 gives large value of SCN, we can reduce the variance of the p-value estimate. Thus, our goal is to design a sampling procedure that satisfies

Ef (SCN(v)) = SCN(ˆv) (5.2) where the expectation is with respect to f, our sampling distribution for v. However, designing such a sampling procedure is not straightforward. To this end, we leverage the fact that we can construct a simpler statistic that approximates SCN which we use for the sampling.

5.2.5 IBD-degreetype

In order to apply importance sampling, we must identify a statistic which roughly approximates SCN but can be conveniently used for designing a sampling procedure.

Since we empirically have observed that SCC approximates SCN (Figure 2), we want to find a statistic that approximates SCC. Our proposed statistic SSUM is related to

SCC through a concept that we introduce called the IBD-degreetype which is simply the degree of each individual in the IBD graph. Obtaining the degrees of vertices is equivalent to splitting all edges and counting how many split edges are adjacent to each vertex (Figure 5.3). Then we assign these numbers to the vertices. Given this, we define the IBD-degreetype as conceptually similar to a genotype where the allele of each individual equals to the degree of the corresponding vertex in the IBD graph.

The IBD-degreetypes can be used for statistical testing for IBD association testing. The intuition is that if case/case pairs have an excessive number of IBDs, then case ver- tices will have higher degrees than control vertices. The test based on IBD-degreetype will be comparing the average degrees between cases and controls,

X wi X wi S = − (5.3) ID N + N − i∈V + i∈V −

87 where wi is the IBD-degreetype of individual i, or equivalently the degree of vertex i in the graph. We note that this statistic is conceptually similar to the traditional association statistic which compares the frequency of the genotypes between the cases and controls. Here we instead compute the difference between the case and controls of the IBD-degreetypes (hence the name).

We are interested in the monotonic relationship between statistics within the per- mutation procedure. Let v1 and v2 be the permuted case/control status. We define monotonic increasing relationship (MIR) as follows,

Definition 1. Two statistics S and T are in a monotonic increasing relationship if,

∀v1 6= v2, S(v1) ≥ S(v2) iff T (v1) ≥ T (v2).

It is clear that if two statistics are in MIR, they will give the same p-value under permutation. It also follows that MIR is transitive (if S and T are in MIR and T and U are in MIR, then S and U are in MIR).

The IBD-degreetype test has the following relationship to the pairwise CC method. Figure 5.3 illustrates this relationship with a toy example.

Lemma 1. In a balanced study design (N + = N −), CC and the IBD-degreetype test are in MIR.

88 A EAE 2 0

B C FG Edge B C FG splitting 2 2 1 1

D H D H 1 1 Traditional Pairwise Statistic (CC) !" $"#$&##'(%# Statistic # – # # – # 4 4 ( 2 ) ( 2 ) 4 4 = 3/6 – 1/6 = 1/3 = 7/4 – 3/4 = 1 Equivalent Case (dif fer by a constant Control multiplication factor N+ – 1 = 3) Figure 5.3: Equivalence of pairwise (CC) method and IBD-degreetype test in a bal- anced study. Note that in pairwise CC method, the edge between (D,F) pair is ignored, which would not be ignored in CN method.

89 Proof. ( ) X wi X wi 1 X X S = − = w − w ID N + N − N + i i i∈V + i∈V − i∈V + i∈V − (   1 X X = 2 e + e N +  ij ij (i,j)∈E++ (i,j)∈E+−   ) X X − 2 eij + eij (i,j)∈E−− (i,j)∈E+−   2  X X  = e − e N + ij ij (i,j)∈E++ (i,j)∈E−−     + 2 N  X eij X eij  = − N + 2 N + N − (i,j)∈E++ 2 (i,j)∈E−− 2  2 N + = S = (N + − 1) S N + 2 CC CC

+ Since SID and SCC differ by a constant multiplication factor (N − 1), they are in MIR.

We introduce another simple form of test statistic based on IBD-degreetype. This sum statistic is the sum of IBD-degreetype alleles or the degrees of the vertices in cases, X SSUM = wi i∈V +

Lemma 2. SID and SSUM are in MIR.

Proof. Note that X X X eij + eij + eij = |e| (i,j)∈E++ (i,j)∈E+− (i,j)∈E−− where |e| denotes the total count of edges. In addition, the sum of the IBD-degreetypes over all vertices is equal to 2|e| because each edge is counted twice. X X wi + wi = 2|e| i∈V + i∈V −

90 Pairwise method

SCN SCC SID SSUM

Correlated MIR in MIR

balanced (Lemma 2) design

(Lemma 1)

Approximate Figure 5.4: Relationship between different statistics.

Therefore,   X wi X wi 1 1 X 2|e| S = − = + w − ID N + N − N + N − i N − i∈V + i∈V − i∈V +  1 1  2|e| = + S − N + N − SUM N −

1 1  2|e| Since N + + N − > 0 and N − is a constant, SSUM is a monotonic increasing linear transformation of SID. Thus, they are in MIR.

5.2.6 Substitution strategy

Here we propose to use SSUM in sampling as a substitution to the pairwise method statistic, SCN. The logical ground for this strategy comes from the relationship between

SCN and SSUM. We have empirically shown that SCN and SCC are highly correlated

(Figure 5.2). Since SCC and SID are in MIR in a balanced study (Lemma 1), we expect that they will be correlated in general even in an unbalanced study and we show this property through a simulation experiment (Supplementary Figure 1). SID is in MIR with SSUM (Lemma 2). Thus, SSUM can be an approximation to SCN (Figure 5.4).

Given this relationship between SCN and SSUM, our strategy is to sample v such that

91 SSUM(v) will be similar to SSUM(ˆv) on average. Our new goal can be described

Ef (SSUM(v)) = SSUM(ˆv) (5.4)

It turns out that this new goal is much easier to achieve. Note that SSUM is used only for sampling. After the sampling is done, the sampled v are used to approximate the p-value of CN method.

This substitution approach works because in importance sampling, the sampling distribution needs not guarantee optimality (the smallest variance of p-value estimate). Instead, a reasonably similar distribution to the optimal distribution suffices. It is clear that this strategy will perform the best if the balanced study condition is met. However, even if the condition is not met, only the variance of p-value estimate is affected and not the mean. The p-value estimate will still be unbiased and it only means that we will need a larger number of samples to obtain the same accuracy.

5.2.7 Sampling with replacement

In this section, we devise a sampling procedure satisfying equation (5.4). Such a sam- pling procedure will be the core part in our importance sampling framework for speed- ing up CN method.

Sampling a random v from all permutations of ˆv can be thought of as sampling N + out of N individuals that will be assigned case status, or equivalently, sampling N + case indices among 1, ..., N. Let a1, ..., aN + be the sampled case indices. These are the indices in v that will be assigned “1”. Sampling case indices is without-replacement sampling procedure; we cannot sample the same index twice, so that exactly N + dis- tinct indices will be sampled at the end (∀i 6= j, ai 6= aj). This way, we can restrict sample space of v to the set of all permutations of ˆv.

However, the design of sampling procedure satisfying equation (5.4) is consider-

92 ably easier if we assume sampling case indices with replacement. That is, we allow the same index can be sampled multiple times. Although this assumption is not valid for our purpose, our strategy is to devise a sampling approach satisfying equation (5.4) as- suming sampling with replacement first, and then extend the approach to the sampling without replacement.

PN Suppose that we pick a1 among 1, ..., N with probability P (a1 = k) = g(k), k=1 g(k) =

1. Since we assume sampling with replacement, sampling a2 is no different from sam-

+ pling a1; in fact, for any 1 ≤ i ≤ N , ai is independent and identically distributed

(IID) with distribution g. Now consider wa1 , the IBD-degreetype allele of a1. Let

Eg(wa1 ) be the expected value of wa1 with respect to g. Again, since we assume sam- pling with replacement, (w ) = ... = (w ). Eg a1 Eg aN+

Then, by the definition of SSUM, we can easily see that

+ Eg(SSUM) = N Eg(wa1 )

Thus, equation (5.4) can be described

+ N Eg(wa1 ) = SSUM(ˆv) or S (ˆv) (w ) = SUM (5.5) Eg a1 N + where the left hand side is the expected case IBD-degreetype allele in our distribution g and the right hand side is the average case IBD-degreetype allele in the observation ˆv. This shows that, if the p-value is highly significant (e.g. the right hand side is large), we should pick a1 (and all ai) such that the expected value of IBD-degreetype allele can be large.

Here we propose a new sampling procedure that satisfies condition (5.5). We define distribution g as follows;

P (a1 = k) = g(k) ∝ tk where tk = 1 + ρwk

93 and SSUM(ˆv) PN N + − wk ρ = N k=1 (5.6) PN 2 SSUM(ˆv) PN k=1 wk − N + k=1 wk It is easy to show that this sampling procedure meets condition (5.5), since condi- tion (5.5) can be described

PN t w PN (1 + ρw )w S (ˆv) (w ) = k=1 k k = k=1 k k = SUM Eg a1 PN PN N + k=1 tk k=1(1 + ρwk) and by solving this for ρ we exactly have equation (5.6). If ρ < 0, then we set ρ to zero to prevent negative tk. Such a case is not of our interest at any rate, since we focus on the one-sided test for detecting excess of case/case IBD. We choose the most simple linear function for tk, which enables us to calculate ρ easily. As a result, for any ˆv, we can calculate ρ and completely define the distribution g. So far, we have successfully defined a sampling procedure satisfying equation (5.4) assuming sampling with replacement.

5.2.8 Sampling without replacement

In this section, we extend the sampling procedure from the previous section to the sampling without replacement. Here we propose to heuristically apply the same sam- pling scheme based on t1, ..., tN to the without-replacement context. When we pick ai (ith case index), we pick index k among {1, ..., N}\{a1, ..., ai−1} with probability PN Pi−1  tk/ j=1 tj − l=1 tal . That is, we assume the same sampling probability propor- tional to tk, but we exclude indices previously picked as cases from our consideration.

However, this sampling procedure does not exactly satisfy equation (5.4) in the without-replacement sampling context. The indices with larger IBD-degreetype alleles are likely to be picked as cases earlier in the procedure and removed. Thus, if we use ρ calculated assuming sampling with replacement, the expected case IBD-degreetype allele (the left hand side of (5.5)) will be smaller than what we would obtain in the

94 with-replacement context. To compensate for this difference, we use the following heuristic. In the middle of sampling, we empirically assess the left hand side of (5.4) by examining the currently obtained samples. Then, we dynamically increment ρ until the left hand side of (5.4) is close to the right hand side. This simple heuristic is suf- ficient because, again, in importance sampling, we only need to approximately satisfy equation (5.4).

5.2.9 P-value calculation

By using the sampling procedure developed in the previous section, we can obtain many sample v that approximately satisfy equation (5.2). The final step is to use these samples to assess the p-value of pairwise (CN) method. In importance sampling, we must account for the fact that we used a sampling distribution that is different from the original distribution. Our original distribution is the uniform distribution defined by the standard permutation approach. The sampling distribution is defined by the sampling procedure that we developed in the previous section.

For a given v, the probability of sampling v differs in the two distributions as follows. In order to sample a v, we sample the case indices a1, ..., aN + . The probability of sampling case indices in the standard uniform distribution is

1 P (v) = Uniform N ··· (N − N + + 1)

On the other hand, the probability of sampling case indices in our sampling procedure described in the previous section is

N + ( N k−1 !) Y . X X PNew(v) = tak tj − tal k=1 j=1 l=1

This is because at each step we pick ith case index ai, we sample the index with prob-

+ ability proportional to tai where the previously picked indices a1, ..., aN are excluded

95 from consideration. Using the standard formula of importance sampling, we approxi- mate the p-value

1 X PUniform(v) pˆ = δ(SCN(v) ≥ SCN(ˆv)) |B| PNew(v) v∈B

Note that we use the pairwise CN statistic in this formula. We use SSUM only to facil- itate the sampling of v, and then use the obtained samples for CN method at the final step.

We can approximate the variance of pˆ with the following formula

(  2 ) 1 X PUniform(v) 2 . δ(SCN(v) ≥ SCN(ˆv)) − pˆ |B| |B| PNew(v) v∈B

5.2.10 Adjusting for population structure

A simple correction for population structure has been previously proposed [PNT07, BT12] for the pairwise method. In this simple approach, the genomic average is sub- tracted from each of the two contrasting terms of the statistic. For example, in CN method, the genomic average of case/case IBD rate is subtracted from the observed case/case IBD rate and the genomic average of non-case/case IBD rate is subtracted from the observed non-case/case IBD rate before calculating the statistic. The same approach can be applied to our Fast-Pairwise method.

5.3 Results

5.3.1 Efficiency

To assess the efficiency gain of our Fast-Pairwise method, we use the Wellcome Trust Case Control Consortium (WTCCC) data [Wel07]. We first run Beagle FastIBD to detect IBD between individuals [BB11]. Then we test individual IBD segments for

96 Traditional Pairwise Method Fast-Pairwise 107 permutations Adaptive perm. 104 samples 103 samples

35 years 474 days 40 days 4 days

Table 5.1: Running time for pairwise IBD association testing for the WTCCC whole genome data associations. We perform IBD association testing using both the traditional pairwise method based on permutation and our Fast-Pairwise. We perform 10 million permuta- tions for the traditional pairwise method. For our Fast-Pairwise method, we perform importance sampling with 1,000 samples and 10,000 samples. We implemented both methods in the Java programming language.

Table 5.1 shows the estimated running time of both methods for analyzing the whole genome data (500,000 SNPs) of a single disease. The time is extrapolated from the estimated time for chromosome 22. Our Fast-Pairwise method is several orders of magnitude faster than the traditional pairwise method. It only takes 4 days for the whole genome while the traditional method can take 13,000 days or 35 years of CPU time.

We can reduce the computation time for the traditional pairwise method by em- ploying an adaptive permutation approach. We can terminate the permutation earlier if the p-value approximates to a non-significant value. Given a p-value p, we need 100/p permutations to obtain the standard error of ∼p/10. Suppose that we sample 100/p permutations for each p-value with upper limit of 10 millions. When we apply this adaptive approach to the WTCCC type 1 diabetes data, the estimated computation time is 474 days. Thus, Fast-Pairwise is still an order of magnitude faster than the traditional pairwise method with an adaptive permutation approach.

97 5.3.2 Accuracy

To assess the accuracy of our importance sampling, we use the simulation framework similar to [BT12]. Using the HapMap ENCODE regions [Int05], we run HapGen2 to simulate 10,000 individuals [SMD11]. These individuals define our founder popula- tion. Then we simulate the first generation by sampling 100,000 individuals from the founders. Then we simulate the second generation by sampling 100,000 individuals from the first generation. We repeat until we obtain the 25th generation. Finally, we use the 25th generation to simulate a case/control study. Within the ENCODE region, we randomly select 5 causal variants among all rare variants (minor allele frequency < 1%). If a haplotype contains one or more causal variants, it confers risk with relative risk selected from uniform(3,10). We assume the disease prevalence of 0.1. Given this disease model, using the standard formula of the case and control minor allele frequen- cies [HKE09a], we sample 1,500 cases and 1,500 controls from the 25th generation. The IBD information between a pair of individuals is determined by tracking if they are descendants of the same founder. We repeat this simulation 100 times per each of the ten ENCODE regions to generate 1,000 sets of case/control studies.

Given these case/control study sets, we assess the p-values of pairwise (CN) method using both the standard permutation and the importance sampling of Fast-Pairwise. We use 10,000 samples for importance sampling and compare to 104, 105, and 106 permu- tations. Figure 5.5 shows that the p-values of two methods track very well within the p-value range that permutation can approximate (up to p-values of 10−4, 10−5, and 10−6 respectively). Within this range, the Pearson correlation r2 of two log p-values are 0.98, 0.94, and 0.99 respectively. This shows that our importance sampling proce- dure obtains accurate p-values.

Moreover, Figure 5.5 emphasizes a fundamental difference between the two meth- ods. In permutation, the range of p-values one can obtain is limited by the number

98 A B C by P 10 log Fast−Pairwise 10,000 Samples −8 −4 0 −8 −4 0 −8 −4 0

−8 −4 0 −8 −4 0 −8 −4 0 log10P by log10P by log10P by 10,000 Permutations 100,000 Permutations 1,000,000 Permutations

Figure 5.5: Accuracy of importance sampling. In simulations using the HapMap ENCODE region, we assess the p-values of pairwise (CN) method using both the standard permutation and Fast-Pairwise (importance sampling). The vertical dashed line denotes the lower bound of p-value that permutation can approximate given the number of permutations. The red dots denote the the simulations where none of the permutations exceeds the observed statistic and therefore the lower bound of p-value is reported by the permutation test. The green triangles in (C) are the p-values that we performed extra permutations (> 106). of permutations. Given |B| permutations, if none of the permutations exceeds the observed statistic, a conservative approximation of p-value is 1/|B|. By contrast, in Fast-Pairwise, the p-value range is not bounded by the number of samples. With a rel- atively small number of samples (10,000), Fast-Pairwise can obtain accurate p-values comparable to the permutation test for a wide range of p-values. We also performed extra permutations (>106) to estimate p-values between 10−6 and 10−8. The p-values of the two methods are still consistent within this p-value range (green triangles in Figure 5.5C).

99 A Traditional Pairwise HLA−DQB1 10 > 8.8Mb 8

−6 P 6 P = 6 × 10 10

log 4 −

2

0

B Fast−Pairwise 45 rs241432 40

35

30 P

10 25 log − 20

15

10

5

0

HLA−A HLA−C HLA−DRB1 HLA−DQB1 HLA−DPA1

HLA−B HLA−DQA1 TAP2 HLA−DPB1

20 22 24 26 28 30 32 34 36 38 40 42 Chromosome 6 position (Mb) Figure 5.6: IBD association testing results of the WTCCC type 1 diabetes data. (A) The results that we would obtain if we use the traditional pairwise method. Given 5 million permutations, the smallest p-value is bounded to 2 × 10−7. The top peak is stretched over >8Mb region complicating the fine-mapping. (B) Fast-Pairwise results. The top peak is at TAP2 whose closest HLA gene is HLA-DQB1. In both plots, the dashed green horizontal line denotes genome-wide threshold 6 × 10−6.

100 5.3.3 Application to WTCCC type 1 diabetes data

[BT12] applied the pairwise (CN) method to the WTCCC type 1 diabetes (T1D) data based on the Beagle FastIBD IBD mapping results [BB11]. Using 5 million permuta- tions, they found that the the major histocompatibility complex (MHC) region in the chromosome 6 is statistically significant given the genome-wide threshold 6×10−6. Since the MHC association to T1D has been historically known [TBM87], this result was a validation that IBD association testing can detect the true association signal.

The limitation of [BT12]’s permutation approach is that although it is possible to determine whether each test is significant (p < 6×10−6) using 5 million permutations, it is not possible to approximate much smaller p-values. Given 5 million permutations, the smallest p-value one can approximate is bounded to 2 × 10−7. In the MHC region, due to the strong signal and the long linkage disequilibrium, the location of the top peak of p-value is important for interpreting and fine-mapping the results. Figure 5.6A shows that the top peak of p-value is stretched over a wide region (>8Mb) including all eight classical HLA genes. Therefore, it is difficult to interpret which HLA gene is likely to be involved in the association.

We applied our Fast-Pairwise to the same dataset. Since Fast-Pairwise is the same pairwise method with increased efficiency, we expected to see the similar results as [BT12]. Indeed, we discovered significant associations within the MHC region. How- ever, the difference is that since our method can approximate very small p-values well beyond the genome-wide threshold, it is possible to localize the statistical signal to a single marker. (Figure 5.6B). The top hit is at SNP rs241432 (p = 7 × 10−45) at the intron of TAP2 gene. Among all major class I and II HLA genes, the closet gene to this peak is HLA-DQB1 (150kb upstream from the peak). It is historically known that the main contributing genes for the MHC association to T1D is HLA-DQB1 [TBM87]. Thus, this result demonstrates that our Fast-Pairwise method can pinpoint the causal

101 gene among many HLA genes within the MHC region, while the traditional pairwise method cannot.

One interesting observation is that the peak association of our IBD association test is on the TAP2 gene that encodes antigen peptide transporter 2 and has been shown to confer independent risk to the T1D when conditioned on the DQ haplotypes [QLM07]. Thus, the peak revealed by our Fast-Pairwise method may imply the added effect of TAP2 in addition to the primary effect of HLA-DQB1 which is in linkage disequilib- rium.

5.4 Discussion

We have developed a new efficient method for pairwise IBD association testing called Fast-Pairwise. Fast-Pairwise employs importance sampling and can perform the pair- wise method more efficiently than the traditional method based on permutation. More- over, unlike permutation, Fast-Pairwise can approximate extremely small p-values be- yond the genome-wide threshold. Using the WTCCC type 1 diabetes data, we show that Fast-Pairwise can successfully pinpoint a gene known to be associated to the dis- ease within the MHC region.

The true utility of the IBD association testing is on finding novel loci where there are potentially multiple rare variants which can not be found using single SNP tests [BT12]. An important advantage of IBD association testing is its wide applicability. The analysis can be performed using the same genotype data collected for single SNP tests without incurring additional cost. For this reason, we feel that many investigators will apply our method to search for these additional loci bearing rare causal variants. What is preventing researchers from applying this approach is an efficient method for IBD association testing which we provide in this chapter. We expect that our new

102 method will promote the wide use of IBD association testing and facilitate further research on the power and utilities of IBD association testing.

Reference to published article

Buhm Han*, Eun Yong Kang*, Soumya Raychaudhuri, Paul I. W. de Bakker and Eleazar Eskin, “Fast Pairwise IBD Association Testing in Genome-wide Association Studies”, in Bioinformatics, 2013.

103 5.5 Supplementary Section

104 A.

0e+00 3e−04 3000 5000

CN 1.00 1.00 0.99 3e−04 0e+00 ● ● ●●●● ●●●●● ●●●● ●●●●● ●●●●●● ●●●●●●● ●●●●●● 3e−04 ●●● ●●●●●●● ●●●●● ●●●●●● CC 1.00 0.99 ●●●●●● ●●●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●● 0e+00 ● ● ● ● ●●● ●●● ●●● ●●● 1.2 ●●●●● ●●● ●●●●●● ●●● ●●●●●● ●●●● ●●●●● ●●● ●●●●●●● ●●● ●●●●●● ●●●● ●●●●●● ●●● ●●●●● ●●●● ●●●●●● ●●● ID 0.99 0.6 ●●●●●● ●●●● ●●●●●●● ●●● ●●●●●● ●●●● ●●●●●● ●●●● ●●●●●● ●●●● ●●●●●● ●●●● ●●●● ●●● 0.0

● ● ● ●●●● ●●●●● ●●●●● ●●●●●● ●●●●● ●●●●● ●●●●●●● ●●●●●●● ●●●●●●● ●●●●●●● ●●●●●●● ●●●●●●● ●●●●●●● ●●●●●● ●●●●●● ●●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●●●● ●●●●●●●●● ●●●●●●●●● ●●●●●●● ●●●●●●● ●●●●●●● 5000 ●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●● SUM ●●●●●●●● ●●●●●●●●● ●●●●●●●●● ●●●●●●●●● ●●●●●●●●● ●●●●●●●● ●●●●●●●●● ●●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●● ●●●●●●● ●●●●●●●●●● ●●●●●●●●● ●●●●●●●●● ●●●●●●● ●●●●●●● ●●●●●●● ●●●●● ●●●●● ●●●●● 3000 0e+00 3e−04 0.0 0.6 1.2

Figure 5.7: Supplementary Figure 1. Correlation between different statistics. We simulated 1,000 studies under a disease model assuming balanced study design (See Section 3.2). Each study consists of 1,500 cases and 1,500 controls. (A) We measured the correlation between different statistics under the balanced study design. Then we sub-sampled 750 cases to simulate unbalanced study design (#case:#control=1:2). (B) We measured the correlation between statistics under the unbalanced study design. The numbers on the upper pane are the Spearman correlation between statistics.

105 B.

0e+00 3e−04 1200 1800

CN 1.00 0.98 0.97 3e−04 0e+00 ● ●● ●●●● ●●●● ●●●● ●●●● ●●●●●● ●●●●● ●●●●●● ●●●●● 3e−04 ●●●● ●●●●● ●●●●● ●●●●●● CC 0.99 0.97 ●●●●●● ●●●●● ●●●●●● ●●●●●● ●●●●●● ●●●● ●●●● 0e+00 ● ● ●●●● ●●●● ●●●●●● ●●●● ●●●●● ●●●● ● ●●●●●● ●●●●● ●●●●●●●● ●●●●●● 0.5 ●●●●●●●●● ●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●● 0.97 ●●●●●●●●●●● ●●●●●●●●● ID ●●●●●●●●●●● ●●●●●●●●● 0.2 ●●●●●●●●●●● ●●●●●●●●● ●●●●●●●●●● ●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●● ●●●●●●● ●●●●●● ●●●● ●●● −0.1 ●● ●● ●●● ●●●●● ●●●● ● ●● ●●●●● ●●●●● ●●●● ●●●●●●●●● ●●●●●●●● ●●●●●●● ●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●● 1800 ●●●●●●●●● ●●●●●●●●● ●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●● ●●●●●●●●●●●●●● SUM ●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●●● ●●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●●● ●●●●●●●●●● ●●●●●●●● ●●●●●●●● ●●●●●●●●● ●●●●●●● ●●●●●● ●●●●●●●

1200 ● ● ●

0e+00 3e−04 −0.1 0.2 0.5

Figure 5.8

106 CHAPTER 6

Meta-analysis identifies Gene-by-environment Interactions as Demonstrated in a Study of 4,965 Mice

6.1 Background

Identifying environmentally specific genetic effects is a key challenge in understanding the structure of complex traits. In humans, gene-by-environment (GxE) interactions have been widely discussed [GLR10, MMF12, SK08, Tal07, FET12, OBF12, DWH13, PCK13, WKZ12, MXX12, GNS12, WWM12] yet only a few have been replicated. One reason for this discrepancy is the inability to accurately control for environmental conditions in humans as well as the inability to observe the same individuals in multi- ple distinct environments. Model organisms do not share such difficulties and for this reason can play a crucial role in the identification of gene-by-environment interactions. For example, in many mouse genetic studies the same traits are examined under differ- ent environmental conditions. Specifically, knock-out or diet-controlled mice are often utilized in the study of cholesterol levels. The availability of these studies presents a unique opportunity to identify genomic loci involved in gene-by-environment interac- tions as well as those loci involved in the trait independent of the environment.

In order to utilize genetic studies in model organisms to identify gene-by-environment interactions, one needs to directly compare the effects of genetic variations in studies conducted under different conditions. This practice is complicated for a number of

107 reasons, when combining more than two studies. First, environmental conditions are often variable across studies and do not fit to the standard univariate model for inter- actions. For example, in one study, cholesterol may be examined under different diet conditions (eg. low fat and high fat) and then in another study cholesterol is examined using gene knockouts. In this case, it is not straightforward to analyze these studies in aggregate using a single variable to represent the environmental condition. Apply- ing a multivariate model, one in which the environment is represented using multiple environmental variables, results in increased degrees of freedom and low statistical power. Second, model organisms such as the mouse exhibit a large degree of pop- ulation structure. Population structure is well-known for causing false positives and spurious associations [PPP06, DRW01] in association analysis and can be expected to complicate the ability to combine separate studies.

In this paper, we propose a random-effects based meta-analytic approach to com- bine multiple studies conducted under varying environmental conditions and show that this approach can be used to identify both genomic loci involved in gene-by- environment interactions as well as those loci involved in the trait independent of the environment. By making the connection between gene-by-environment interactions and random effects model meta-analysis, we show that interactions can be interpreted as heterogeneity and detected without requiring uni- or multi-variate models. We also define an approach for correcting population structure in the random effects model meta-analysis, extending the methods developed for fixed effects model meta-analysis [FKV12]. We show that this method enables the analyses of large scale meta-analyses with dozens of heterogeneous studies and leads to dramatic increases in power. We demonstrate that insights regarding gene-by-environment interactions are obtained by examining the differences in effect sizes among studies facilitated by the recently de- veloped m-value statistic [HE12], which allows us to distinguish between studies hav- ing an effect and studies not having an effect at a given locus.

108 We applied our approach, which we refer to as Meta-GxE, to combine 17 mouse High-density lipoprotein (HDL) studies containing 4,965 distinct animals. To our knowledge, this is the largest mouse genome-wide association study conducted to date. The environmental factors of the 17 studies vary greatly and include various diet con- ditions, knock-outs, different ages and mutant animals. By applying our method, we have identified 26 significant loci. Consistent with the experience of meta-analysis in human studies, our combined study finds many loci which were not discovered in any of the individual studies. Among the 26, 24 loci have been previously implicated in having an effect on HDL cholesterol or closely related lipid levels in the blood, while 2 loci are novel findings. In addition, our study provides insights into genetic effects on several disease loci and their relationship between environment and sex. For example, we identified 3 loci (Chr10:21399819, Chr19:3319089, ChrX:151384614), where fe- male mice show a more significant effect on HDL phenotypes than male mice. We also identified 7 loci (Chr1:171199523, Chr8:46903188, Chr8:64150094, Chr8:84073148, Chr10:90146088, Chr11:69906552, Chr15:21194226) where male mice show a more significant effect on HDL than female mice. In addition, many of the loci show strong gene-by-environment interactions. Using additional information describing the studies and our predictions of which studies do and do not contain an effect, we gain insights into the interaction. For example, locus on chromosome 8 (Chr8:84073148) shows a strong sex by mutation-driven LDL level interaction, which affects HDL cholesterol levels.

Part of the reason for our success in identifying a large number of loci is that our study combined multiple mouse genetic studies many of which use very different mapping strategies. Over the past few years, many new strategies have been proposed beyond the traditional F2 cross [FE12] which include the hybrid mouse diversity panel (HMDP) [BFO10b, GRF12], heterogeneous outbred stocks [VSG06], commercially available outbred mice [YNB10], and the collaborative cross [AVF11]. In our current

109 study, we are combining several HMDP studies with several F2 cross studies and ben- efit from the statistical power and resolution advantages of this combination [FKV12]. The methodology presented here can serve as a roadmap for both performing and plan- ning large scale meta-analysis combining the advantages of many different mapping strategies.

6.2 Results

6.2.1 Discovering environmentally-specific loci using meta-analysis

The Meta-GxE strategy uses a meta-analytic approach to identify gene-by-environment interactions by combining studies that collect the same phenotype under different con- ditions. Our method consists of four steps. First, we apply a random effects model meta-analysis (RE) to identify loci associated with a trait considering all of the stud- ies together. The RE method explicitly models the fact that loci may have different effects in different studies due to gene-by-environment interactions. Second, we apply a heterogeneity test to identify loci with significant gene-by-environment interactions. Third, we compute the m-value of each study to identify in which studies a given vari- ant has an effect and in which it does not. Forth, we visualize the result through a forest plot and PM-plot to understand the underlying nature of gene-by-environment interactions.

We illustrate our methodology by examining a well-known region on mouse chro- mosome 1 harboring the Apoa2 gene, which is known to be strongly associated with HDL cholesterol [WHQ93]. Figure 6.1 shows the results of applying our method to this locus. We first compute the effect size and its standard deviation for each of the 17 studies. These results are shown as a forest plot in Figure 6.1 (a). Second we compute the P-value for each individual study also shown in Figure 6.1 (a). If we were to fol-

110 low traditional methodology and evaluate each study separately, we would declare an effect present in a study if the P-value exceeds a predefined genome-wide significance threshold (P < 1.0×10−6). In this case, we would only identify the locus as associated in a single study, HMDP-chow(M) (P = 6.84 × 10−9). On the other hand, in our ap- proach, we combine all studies to compute a single P-value for each locus taking into account heterogeneity between studies. This approach leads to increased power over the simple approach considering each study separately. The combined meta P-value for the Apoa2 locus is very significant (4.41 × 10−22), which is consistent with the fact that the largest individual study only has 749 animals compared to 4,965 in our combined study.

In order to evaluate how significantly different the effect sizes of the locus are be- tween studies, we apply a heterogeneity test. The statistical test is based on Cochran’s Q test [DL86a, Coc09], which is a non-parametric test for testing if studies have the same effect or not. In this locus, the effect sizes are clearly different and not surpris- ingly the P-value of the heterogeneity test is significant (5.80 × 10−5). This provides strong statistical evidence of a gene-by-environment interaction at the locus. Below we more formally describe how heterogeneity in effect size at a given locus can be interpreted as gene-by-environment interaction.

If a variant is significant in the meta-analytic testing procedure, then this implies that the variant has an effect on the phenotype in one or more studies. Examining in which subset of the studies an effect is present and comparing to the environmental conditions of the studies can provide clues to the nature of gene-by-environment in- teractions at the locus. However, the presence of the effect may not be reflected in the study-specific P-value due to a lack of statistical power. Therefore, it is difficult to distinguish only by a P-value if an effect is absent in a particular study due to a gene-by-environment interaction at the locus or a lack of power. In order to identify

111 which studies have effects, we utilize a statistic called the m-value [HKE09b], which estimates the posterior probability of an effect being present in a study given the ob- servations from all other studies. We visualize the results through a PM-plot, in which P-values are simultaneously visualized with the m-values at each tested locus. These plots allow us to identify in which studies a given variant has an effect and in which it does not. M-values for a given variant have the following interpretation: a study with a small m-value(≤ 0.1) is predicted not to be affected by the variant, while a study with a large m-value(≥ 0.9) is predicted to be affected by the variant.

The PM-plot for the Apoa2 locus is shown in Figure 6.1 (b). If we only look at the separate study P-values (y-axis), we can conclude that this locus only has an effect in HMDP-chow(M). However, if we look at m-value (x-axis), then we find 8 stud- ies (HMDPxB-ath(M), HMDPxB-ath(F), HMDP-chow(M), HMDP-fat(M), HMDP- fat(F), BxD-db-5(M), BxH-apoe(M), BxH-apoe(F)), where we predict that the varia- tion has an effect, while in 3 studies (BxD-db-12(F), BxD-db-5(F), BxH-wt(M)) we predict there is no effect. The predictions for the remaining 6 studies are ambiguous.

From Figure 1, we observe that differences in effect sizes among the studies are remarkably consistent when considering the environmental factors of each study as described in Table 1. For example, when comparing study 1 – 4, the effect size of the locus decreases in both the male and female HMDPxB studies in the chow diet (chow study) relative to the fat diet (ath study). Thus we can see that when the mice have Leiden/CETP transgene, which cause high total cholesterol level and high LDL cholesterol level, effect size of this locus on HDL cholesterol level in blood is affected by the fat level of diet. Similarly, when comparing study 12 – 15, the knockout of the Apoe gene affects the effect sizes for both male and female BxH crosses. However, in the BxD cross (study 8 – 11), where each animal is homozygous for a mutation causing a deficiency of the leptin receptor, the effect of the locus is very strong in the young

112 male animals, while as animals get older and become fatter, the effect becomes weaker. However in the case of female mice, the effect of the locus is nearly absent at both 5 and 12 weeks of age. Thus we can see that sex plays an important role in affecting HDL when the leptin receptor activity is deficient . We note that there are many genes in this locus and the genetic mechanism of interactions may involve genes other than Apoa2. Despite this caveat, the results of Meta-GxE at this locus provides insights into the nature of GxE and can provide a starting point for further investigation.

We note that an alternate explanation for differences in effect sizes between stud- ies is the presence of gene-by-gene interactions and differences in the genetic back- grounds of the studies. While this is a possible explanation for differences in effect sizes between the different crosses and the HMDP studies, in Figure 6.1, we see many differences in effect sizes among studies with the same genetic background. Thus gene-by-gene interactions can only partially explain the differences in observed effect sizes.

6.2.2 The connection between random effects meta-analysis and gene-by-environment interactions

Gene-by-environment interactions, random effects meta-analysis and heterogeneity testing are closely related. Suppose we have k studies each conducted under differ- ent environmental conditions. We define the following linear model, where yi is the observed phenotype for study i, αi is the phenotype mean for study i, δi is the genetic effect on the phenotype for study i, X is the genotype, and e is the residual error.

yi = αi + δiX + e (6.1)

Since each environment is different, the effect size δi is partially determined by

113 environmentally-specific factors and partially determined by factors common to all studies. Given that we can decompose the effect δi into environment-independent and environment-dependent factors. Then we define the following linear model, where

β is the environment-independent genetic effect and γi is the environment-dependent genetic effect for study i.

yi = αi + βX + γiX + e (6.2)

In order to test for the presence of an effect shared across environments, we test the null hypothesis β = 0 and to test for the presence of a gene-by-environment interaction, we test the hypothesis that γi = 0.

In the random effects meta-analysis, we assume that the effect size δi is sampled

2 2 from a normal distribution with mean µ and variance τ , denoted δi ∼ N(µ, τ ). Un- der this assumption, we test the null hypothesis µ = 0 and τ 2 = 0, in order to obtain a study-wide P-value. Additionally, we perform a heterogeneity test to test the null hy- pothesis δ1 = ... = δk versus the alternative hypothesis NOT (δ1 = ... = δk). We posit that by conducting hypotheses tests in the meta-analysis framework, we are simultane- ously testing for the presence of environmentally-independent and environmentally- specific effects and that by applying heterogeneity testing we are testing for only environmentally-specific effects.

Consider that in the meta-analysis framework µ is analogous to β and the variation

2 (τ ) around µ is analogous to variation among γis. In the random effects meta analysis testing framework we are testing if µ = 0 and τ 2 = 0. This is equivalent to testing both environmentally-independent (β = 0) and environmentally-dependent (γi = 0) effects simultaneously. In heterogeneity testing, we test the null hypothesis δ1 = ... = δk versus the alternative hypothesis NOT (δ1 = ... = δk). When the environmentally-

2 2 dependent effect (γi) is 0 it means that τ = 0 and thus δ1 = ... = δk. When τ 6= 0,

114 we expect that δi will vary around µ, so that we do not expect that δi = δj. Since the

2 variation (τ ) of δi around µ is analogous to the variable γi, heterogeneity testing in the meta-analysis framework is approximately equivalent to testing for environmentally- specific effects.

6.2.3 Gene-by-environment interactions are prevalent in mouse association stud- ies

The presence of heterogeneity in the effect size at causal genetic loci due to gene-by- environment interactions is naturally expected in mouse genetic studies when combin- ing studies with varying environmental conditions. One extreme example comes from a knock-out experiment. If the knocked-out gene is causal for a particular trait, then we can expect that the gene would have no effect on a knock-out mouse, while the gene would have an effect on the wild type mouse. This is a binary form of heterogeneity. In a less extreme form of heterogeneity, the effect of a given gene may be affected by an environmental factor which varies in different mice – ranging from small effects to large effects.

To see the relationship between significance of the association and gene-by-environment interactions, we compute and compare this P-value for each SNP from the 17 studies using the random effects meta-analysis to a measure of heterogeneity. Heterogene- ity can be assessed by I2 statistic, which describes the percentage of variation across studies that is due to heterogeneity rather than chance [HT02].

Figure 6.2 compares I2 statistic with the meta-analysis P-value for each SNP. In this figure, we see that I2 is uniformly distributed for the non-significant SNPs (blue dots), while it is right skewed for significant SNPs (red dots), indicating that more significant SNPs have a greater potential for exhibiting heterogeneity in effect. Since heterogeneity in this case can be interpreted as representing gene-by-environment in-

115 teractions, as heterogeneity is induced by differences in the environment, we see that the presence of a GxE interaction confers higher power to detect an association.

6.2.4 Power of meta-analysis for detecting gene-by-environment interactions

The power to identify both gene-by-environment and main effects in a meta-analysis of mouse studies depends on both the main effect size and the amount of heterogeneity. We performed simulations using the genotypes of the 17 mouse studies analyzed in this paper. We simulated a range of main effect (mean effect) sizes and a range of gene-by- environment effects. We are simulating the realistic scenario in which we do not know exactly the set of covariates which are responsible for the gene-by-environment effects. We simulated gene-by-environment effects by drawing the effect in each study from a distribution with a mean given by the main effect size and a variance controlling the magnitude of gene-by-environment interactions. If this variance is small, then all of the studies have close to the same effect size and there are few gene-by-environment effects. If the variance is high, then there are strong gene-by-environment effects. Figure 6.3 shows the results of our simulations. 1000 simulated phenotypes were generated for each mean and variance pair. Statistical power is estimated by computing the proportion of the datasets in which a simulated effect is detected. We observe that the power is high for a wide range of main effect sizes and gene-by-environment effect sizes which is explained by the large sample size of the study. We also observe that even for small main effects, if there are strong gene-by-environment effects, we can still identify the locus. This is because in this case a subset of the studies will have strong effect sizes due to gene-by-environment effects.

Our approach is not the only way to analyze a meta-analysis study. We compare the power to two other meta-analytic approaches. The first is the traditional meta- analysis strategy which uses a fixed effects model (FE) in which all of the effect sizes

116 across studies are assumed to be the same. We utilize an extension of the fixed effects model which corrects for population structure [FKV12]. A second alternate strategy is to simply apply the heterogeneity test (HE), which in our framework is only applied to loci first identified using random effects meta-analysis. The HE test follows the in- tuition that loci with high heterogeneity will harbor gene-by-environment interactions. For the purposes of the comparison we refer to Meta-GxE as the random effects (RE) model.

The level of gene-by-environment interactions can be simulated by changing both the environment-dependent and environment-independent effect simultaneously, when simulating the phenotype. Figure 8.1 (a) - (c) shows the power of the three approaches (RE, FE, HE) respectively when we vary the mean and variance of the effect size dis- tribution we sampled from. In this simulation study, mean effect represents shared effect and variance of the effect size represents interaction effect. As expected, RE has high power in cases where the shared effect or the interaction effect is large. FE has high power when the shared effect is large and the HE test has high power when the interaction effect is large. Figure 8.1 (d) shows the heatmap which is colored with the color of highest powered approach. FE is most powerful at the top-left region, HE is most powerful at the bottom-right region, while RE is most powerful for a majority of the simulations. In the Supplementary information, we show through simulations that our methodology outperforms the alternative fixed effects and heterogeneity test- ing approaches when the effect is present in a subset of the studies, which is another possible interaction model we can assume. We also show in the Supplementary in- formation that our approach is more powerful than the traditional uni- or multi-variate gene-by-environment association approach which assumes knowledge of the covari- ates involved in gene-by-environment interactions. For the traditional uni- or multi- variate approach, required knowledge includes kinds of variable (e.g. sex, age, gene knockouts) and encoding of the variables (e.g. binary values, continuous values). In

117 the Supplementary section, we also show the our proposed approach controls the false positive rate.

6.2.5 Application to 17 mouse HDL studies

We applied Meta-GxE to 17 mouse genetic studies conducted under various environ- mental conditions where each study measured HDL cholesterol. Table 6.1 summarizes each study. More details are provided in the Methods and Supplementary Information. We analyzed all 17 studies together and we also analyzed the 9 male and 8 female studies separately. Some significant associations are shared and some associations are specific to males and females.

The Manhattan plots in Figure 6.5 show the meta-GxE result when applied to the 17 studies, 9 male only studies and 8 female only studies. Table 6.2 summarizes 26 significant peaks (P < 1.0 × 10−6) showing the P-values obtained by applying meta- GxE to the male only studies (9 studies), the female only studies (8 studies) and the male + female studies (17 studies). For each significant locus, we computed m-values, interpreted as the posterior probability of having an effect on the phenotype and report the number of studies with an effect (E), the number of studies with ambiguous effect size (A) and the number of studies without an effect (N). We also report the number of individual studies where the locus was significant ( P < 1.0 × 10−6). As seen in the table, many of the loci were not significant in any of the individual studies and would not have been discovered without combining the studies. We note that we use a more stringent genome wide threshold of p < 1.0 × 10−6 than was used in the original studies. The Genes in Region and Gene Refs columns contain the gene names near the locus previously known to affect HDL cholesterol level or closely related lipid level in the blood and associated literature citations.

Among the 26 loci that we identified by applying Meta-GxE, 24 loci are near the

118 genes (mostly genes are located within 1MB of the peak) known to affect HDL or closely related lipid level in the blood, while 3 loci are novel.

For example, we identified 3 loci (Chr10:21399819, Chr19:3319089, ChrX:151384614) female mice show a more significant effect on HDL phenotypes than male mice. We also identified 7 loci (Chr1:171199523, Chr8:46903188, Chr8:64150094, Chr8:84073148, Chr10:90146088, Chr11:69906552, Chr15:21194226) where male mice show a more significant effect on HDL than female mice.

Interestingly, we observed that in 3 loci (Chr10:21399819, Chr19:3319089, ChrX:151384614), female mice are more highly affected, while in 7 loci (Chr1:171199523, Chr8:46903188, Chr8:64150094, Chr8:84073148, Chr10:90146088, Chr11:69906552, Chr15:21194226) male mice are more highly affected. Among 26 loci, many show a significant het- erogeneity in effect sizes between the 17 studies, which we interpret as gene-by- environment interactions.

One interesting example showing strong gene-by-environment interaction is a lo- cus in Chr8:84073148. This locus is located near the gene Prkaca, which is known to affect the abnormal lipid levels in blood [NWW05]. Figure 6.6 shows the forest plot and PM-plot for this locus. If we look at the forest plot of the locus in Figure 6.6, we can easily see that there are two groups: 12 studies with an effect (red dots) and 5 studies with an ambiguous prediction of the existence of an effect (green dots). Interestingly, the log odds ratios of effect size for the 12 studies with an effect is about the same (around 0.2). The common characteristic in 4 of the 5 studies (HMDPxB- chow(F), HMDPxB-ath(F), BXH-apoe(F), CXB-ldlr(F)) is that they are female mice with high LDL levels in the blood. In addition, in all 4 cases, these high LDL lev- els are caused by mutant genes. Mice in HMDPxB-chow and HMDPxB-ath studies have transgenes for both Apoe Leiden [37] and for human Cholesterol Ester Transfer Protein (CETP), while mice in the BXH-apoe and CXB-ldlr studies carried knockouts

119 of the genes for Apoe and LDL receptor, respectively. This is a strong evidence that there is an interaction between sex × mutation-driven LDL levels through this locus (Chr8:84073148 ) when affecting HDL levels in mice.

Supplementary Figures S5 - S30 show the forest plots and PM-plots for each locus, which show information such as effect sizes, the direction of the effect, which study has an effect and which study does not have an effect for each of 17 studies at the given locus.

6.3 Discussion

In this paper, we present a new meta-analysis approach for discovering gene-by-environment interactions that can be applied to a large number of heterogeneous studies each con- ducted in different environments and with animals from different genetic backgrounds. We show the practical utility of the proposed method by applying it to 17 mouse HDL studies containing 4,965 mice, and we successfully identify many known loci involved in HDL. Consistent with the results of meta-analysis in human studies, our combined study finds many loci which were not discovered in any of the individual studies.

A point of emphasis is that in our study design, in each of the combined studies, all of the individuals in the study are subject to only a single environment. This is distinct from other approaches for discovery of gene-by-environment interactions using meta- analysis such as those described in [MLL11]. In these approaches, in each of the combined studies, the individuals in the study are subject to multiple environments and information on each individual’s environment is collected. Gene-by-environment statistics are then computed in each study and then combined in the meta-analysis. In our study design, we compute main effect sizes for each SNP and then look for variants where the effect sizes are different suggesting the presence of a gene-by-environment

120 interaction.

In our meta-analysis approach, we assume that we do not have any prior knowledge of the effect size in any particular study. However one might incorporate prior knowl- edge of the specific environmental effects. In some cases, one might know that some of the studies have similar effect sizes as compared to others. Or the prior knowledge might suggest that one specific study needs to be eliminated in the meta analysis. If we utilize such prior knowledge, we may be able to achieve even higher statistical power.

In this paper we have addressed how to perform meta-analysis when the stud- ies have different genetic structures, building off the results of our previous study [FKV12]. While in this paper we combine 7 HMDP studies with 10 genetic crosses, the approach in principle can be used to combine any variety of study types. Recently, several strategies for mouse genome-wide association mapping have been proposed [FE12]. These include HMDP [BFO10b], collaborative cross [CAA04] and outbred- stock [YNB10] [FE12]. The approach presented here can be utilized to combine these different kinds of studies and is a roadmap for integrating the results of different strate- gies for mouse GWAS.

Although we have focused on explaining heterogeneity by gene-by-environment interaction, it is possible that the differences in effect sizes can be caused by gene- by-gene interactions on different genetic backgrounds, where the interacting variants differ in frequency in the different studies. While gene-by-gene interactions certainly contribute to locus heterogeneity, we predict that, in combining studies with similar genetic structures, locus heterogeneity more likely arises from gene-by-environment interactions. In any case, determining whether or not these heterogeneous loci are environment-driven or interaction-driven is an important and interesting direction for future study.

121 6.4 Methods and Materials

6.4.1 Standard study design for testing gene-by-environment interactions

In the model organism studies for which we can control the environment, the stan- dard study design for testing gene-by-environment interactions is to combine multiple cohorts whose environments are known. The environmental value that we vary is typ- ically a quantitative measure that we can model with a single random variable. Thus, the standard univariate linear model can be applied

y = µ + αD + βX + γX · D + e where y is n×1 vector of phenotype measurements from n individuals, µ is the pheno- type mean, α is the main environmental effect mean, D is n × 1 environmental status vector, β is the genetic effect, X is n×1 genotype vector, γ is GxE interactions effect, · denotes the dot-product between two vectors, and e is the residual error, which follows normal distribution. In this model, vector D is a vector of indicators which describes the environmental status of each individual. study. For example, Suppose the environ- mental condition of one study is wildtype and that of another is gene knockout. In this case, the environmental condition of wildtype is described as 0 and that of knockout is described as 1. In order to test if there are interactions, we test the null hypothesis γ = 0 versus the alternative hypothesis γ 6= 0. Another possible testing strategy is to test the interactions effect together with the genetic effect, that is, the null hypothesis β = 0 and γ = 0 versus the alternative hypothesis β 6= 0 or γ 6= 0. This strategy is powerful in detecting loci exhibiting both the genetic effects and the interactions effects.

122 6.4.2 Multivariate interactions model

For more complicated scenarios where the different environments can not be modeled with a single variable, a straightforward extension of the standard univariate interac- tions model is the multivariate model. Suppose that there are k different possible envi- ronments and the information on the environments of each individual are captured by a matrix D which has k columns where each column corresponds to one environment. Then, the standard multivariate interactions model will be

k k X X y = αiDi + βX + γiX · Di + e (6.3) i=1 i=1

th Di is the i column of the D matrix, αi is the environment specific mean, y denotes the phenotype measurements, X denotes the genotypes, β denotes the fixed genetic

th effect, γi denotes GxE interactions effect of i environmental variable and , and e denotes the residual error. Then the testing will be between the null hypothesis γ1 =

0 and ... and γk = 0 versus the alternative hypothesis γ1 6= 0 or ... or γk 6= 0. The test statistic will be k X 2 SMult = Zi i=1 2 where Zi is the z-score corresponding to γi. SMult follow χ(k) under the null. Sim- ilarly to the univariate model, if we want to test the interactions effect together with genetic effect, we add the z-score corresponding to β into the statistic, in which case

2 the statistic will follow χ(k+1).

6.4.3 Standard meta-analysis approach

Before we describe the relationship between gene-by-environment interactions and meta-analysis, we first describe the standard fixed effects and random effects meta- analysis in details.

123 6.4.3.1 Fixed effects model meta analysis

In standard meta-analysis, we have N studies. In each of the N studies, we estimate the effect size of interest. Suppose that we estimate the genetic effect in study i,

yi = αi + δiXi + ei (6.4)

We can obtain the estimates of δi and its variance Vi. In the fixed effects model meta-analysis, we assume that the underlying effect sizes are the same as δ (δ = δ1 =

... = δN ). The best estimate of δ is the inverse-variance weighted effect size, P ¯ Wiδi δ = P , (6.5) Wi where Wi = 1/Vi is the so-called inverse variance. Then we test the null hypothesis δ = 0 versus the alternative hypothesis δ 6= 0.

6.4.3.2 Testing heterogeneity

The phenomenon that the underlying effect sizes differ between studies is called het- erogeneity. The presence of heterogeneity is tested using the Cochran’s Q test [DL86a, Coc09]. Cochran’s Q test is a non-parametric test for testing if N studies have the same effect or not. Particularly it tests the null hypothesis δ1 = ... = δN versus the alterna- tive hypothesis NOT (δ1 = ... = δN ). Cochran’s Q statistic can be calculated as the weighted sum of squared differences between individual study effects and the pooled effect across studies. N X ¯ 2 Q = Wi(δi − δ) (6.6) i=1 Cochran’s Q statistic has a chi-square statistic with N − 1 degrees of freedom.

124 6.4.3.3 Random effects model meta analysis

Under the random effects model meta-analysis, we explicitly model heterogeneity by assuming a hierarchical model. We assume that the effect size of each study δi is a random variable sampled from a distribution with amean δ and variance τ 2,

2 δi ∼ N(δ, τ )

Traditional formulations of a random effects meta-analysis method are known to be overly conservative [DL86a, IPE07a, IPE07b]. However, we recently developed a random effects model that addresses this issue [HE11b]. The method assumes that there is no heterogeneity under the null, a modification that is natural in the context of association studies because the effect size should be fixed to be zero under the null hypothesis. This random effects model tests the null hypothesis δ = 0 and τ 2 = 0 versus the alternative hypothesis δ 6= 0 or τ 2 6= 0.

Similarly to the traditional random effects model [DL86a], we use the likelihood ratio framework considering each statistic as a single observation. Since we assume no heterogeneity under the null, µ = 0 and τ 2 = 0 under the null hypothesis. The likelihoods are then

Y 1  δ2  L = √ exp − i 0 2πV 2V i i i  2  Y 1 (δi − µ) L1 = exp − . p 2 2(V + τ 2) i 2π(Vi + τ ) i The maximum likelihood estimates µˆ and τˆ2 can be found by an iterative procedure suggested by Hardy and Thompson [HT96]. Then the likelihood ratio test statistic can be built

  2 2 X Vi X δi X (δi − µˆ) Smeta = −2 log(λ) = log 2 + − 2 , (6.7) Vi +τ ˆ Vi Vi +τ ˆ whose P-value is calculated using tabulated values [HE11b].

125 6.4.4 Relation between gene-by-environment interactions and meta-analysis

Here we explain more about the relationship between gene-by-environment interac- tions and meta-analysis based on the explanation in Results section. If we do not consider the interactions, it has been already known that the fixed effects model meta- analysis is approximately equivalent to the linear model of combined cohorts [LS09]. That is, the fixed effects model equation (8.3) gives approximately equivalent results to the combined linear model

k X y = αiAi + βX + e (6.8) i=1 where X is the combined genotype vector from all cohorts, A is a matrix that in- cludes indicator columns which identify which individual is in each cohort, Ai is the

th i column of matrix A, and αi is the cohort specific mean. The two methods are ap- proximately equivalent because they both test the fixed mean effect (β in equation (6.8) and δ in equation (8.3)). The subtle difference between the two models is that in equa- tion (6.8), we assume the error e follows a single normal distribution (e.g. N(0, σ2)), whereas in equation (8.3), the variance of the distributions may differ between studies

2 (e.g. ej ∼ N(0, σj ) for each j). In other words, under the constant error variance as- 2 2 2 sumption (σ1 = σ2 = ... = σN ), the two models become equivalent and β in equation (6.8) equals δ in equation (8.3), β = δ .

Similarly, by considering interactions, we extend this argument to show the rela- tionship between gene-by-environment interactions and meta-analysis. We consider the relationship between equation (6.3) and equation (6.4). For simplicity of the no- tation, we consider the case where the matrix D is defined in such a way that each individual is only in one environment such that the D matrix is equivalent to the matrix A described above. If we assume the constant error variance assumption, we establish

126 the following relationship,

δi = β + γi where the left hand side is the coefficient of the genotype Xi of study i from the meta- analysis equation (6.4) and the right hand side is the same coefficient of Xi (the study i’s part within the combined genotype matrix X) from the equation (6.3).

Suppose that there are no interactions (null hypothesis of interaction testing). Then,

γi = 0 for each study i. Thus, the effect size of meta-analysis δi is equivalent to β, the genetic effects that are invariant across studies. Therefore, τ 2 = 0 (null hypothesis of heterogeneity testing). On the other hand, suppose that τ 2 = 0 (null hypothesis of heterogeneity testing). Naturally, γi = 0 for all studies (null hypothesis of interaction testing). This shows that the null hypothesis of the interactions test in the model (6.3) and the null hypothesis of the heterogeneity test in meta-analysis are equivalent. As a result, we can utilize meta-analytic heterogeneity testing to detect interactions.

Using reasoning, it is straightforward to show that we can utilize the random effects model meta-analysis method to detect the mean effect and the interaction effect at the same time, which can be powerful for identifying loci bearing both kinds of effects.

6.4.5 Controlling for population structure within studies

Model organism such as the mouse are well-known to exhibit population structure or cryptic relatedness [DRB01, VP05], where genetic similarities between individuals both inhibit the ability to find true associations and cause the appearance of a large number of false or spurious associations. Mixed effects models are often used in order to correct this problem [Lan02, YPB06, KZW08, KSS10, LQK13, LLK13]. Meth- ods employing a mixed effects correction account for the genetic similarity between individuals with the introduction of a random variable into the traditional linear model.

127 yi = µ + δiX + ui +  (6.9)

In the model in equation (8.8), the random variable ui represents the vector of genetic contributions to the phenotype for individuals in population i. This random

2 variable is assumed to follow a normal distribution with ui ∼ N(0, σg Ki), where Ki is the ni × ni kinship coefficient matrix for population i. With this assumption, the total

2 2 variance of yi is given by Σi = σg Ki + σe I. A z-score statistic is derived for the test ˆ δi = 0 by noting the distribution of the estimate of δi. In order to avoid complicated notation, we introduce a more basic matrix form of the model in equation (8.8), shown in equation (8.9).

yi = SiΓ + ui +  (6.10)

In equation (8.9), Si is a ni × 2 matrix with the first column being a vector of 1s representing the global mean and the second vector is the vector and Γ is a 2 × 1 coefficient vector containing the mean αi and genotype effect (δi). We note that this form also easily extends to models with multiple covariates. The maximum likelihood ˆ 0 −1 −1 0 −1 estimate for Γ in population i is given by Γi = (SiΣi Si) SiΣi yi which follows a 0 −1 −1 normal distribution with a mean equal to the true Γ and variance (SiΣi Si) . The estimates of the effect size δi and standard error of the δi (SE(δi)) are then given in equation (8.10) and equation (8.11), where R = [0 1] is a vector used to select the ˆ appropriate entry in the vector Γi.

0 −1 −1 0 −1 δi = R(SiΣi Si) SiΣi yi (6.11)

0 −1 −1 0 1/2 SE(δi) = [R(SiΣi Si) R ] (6.12)

128 6.4.6 Meta-analysis of studies with population structure

When we test gene-by-environment interactions with meta analysis approaches, one important step is correcting for population structure. This can be achieved by correct- ing for population structure within each study first as described above. For example, consider the random effects model meta-analysis method that we primarily focus on. We employ population structure control, using (8.10) and (8.11). Then the likelihood ratio test statistic will be

  2 2 X Vi X δi X (δi − µˆ) SPop = −2 log(λ) = log 2 + − 2 , (6.13) Vi +τ ˆ Vi Vi +τ ˆ

0 −1 −1 0 −1 0 −1 −1 0 where δi = R(SiΣi Si) SiΣi yi and Vi = [R(SiΣi Si) R ].

6.4.7 Identifying studies with an effect

After identifying loci exhibiting interaction effects, we employ the meta-analysis inter- pretation framework that we recently developed. The m-value [HE12] is the posterior probability that the effect exists in each study. Suppose we have n number of studies we want to combine. Let E = [δ1, δ2, ..., δn] be the vector of estimated effect sizes and

V = [V1,V2, ..., Vn] be the vector of estimated variance of n effect sizes. We assume that the effect size δi follows the normal distribution.

P (δi|no effect) = N(δi; 0,Vi) (6.14)

P (δi|effect) = N(δi; µ, Vi) (6.15)

We assume that the prior for the effect size is

µ ∼ N(0, σ2) (6.16)

A possible choice for σ in GWAS is 0.2 for small effect and 0.4 for large effect [SB09].

We also denote Ci be a random variable whose value is 1 if a study i has an effect and

129 0 otherwise. We also denote C as a vector of Ci for n studies. Since C has n binary

n values, C can be 2 possible configurations. Let U = [c1, ..., c2n ] be a vector con- taining all the possible these configurations. We define m-value mi as the probability

P (Ci = 1|E), which is the probability of study i having an effect given the estimated effect sizes. We can compute this probability using the Bayes’ theorem in the follow- ing way. P P (E|C = c)P (C = c) c∈Ui mi = P (Ci = 1|E) = P (6.17) c∈U P (E|C = c)P (C = c) th where Ui is a subset of U whose elements’ i value is 1. Now we need to compute P (E|C = c) and P (C = c). P (C = c) can be computed as

B(|c| + α, n − |c| + β) P (C = c) = (6.18) B(α, β) where |c| denotes the number of 1’s in c and B denotes the beta function and we set α and β as 1 [HE12]. The probability E given configuration c, P (E|C = c), can be computed as

Z ∞ Y Y P (E|C = c) = N(δi; 0,Vi) N(δi; µ, Vi)p(µ) dµ (6.19) −∞ i∈c0 i∈c1 ¯ ¯ ¯ 2 Y = CN(δ; 0, V + σ ) N(δi; 0,Vi) (6.20)

i∈c0

P ¯ i Wiδi ¯ 1 δ = P and V = P (6.21) i Wi i Wi where where c0 is the indices of 0 in c and c1 is the indices of 1 in c, N(δ; a, b) denotes the probability density function of the normal distribution with mean a and variance b. −1 ¯ Wi = Vi is the inverse variance or precision and C is a scaling factor.

s Q ( P 2 !) ¯ 1 i Wi 1 X 2 ( i Wiδi) C = √ exp − Wiδ − (6.22) N−1 P W 2 i P W ( 2π) i i i i i ¯ ¯ ¯ All summations appeared for computing δ, V and C are with respect to j ∈ t1.

130 The m-values have the following interpretations: small m-values(0.1) represent a study that is predicted to not have an effect, large m-values(0.9) represent a study that is predicted to have an effect, otherwise it is ambiguous to make a prediction. It was previously reported that m-values can accurately distinguish studies having an effect from the studies not having an effect [HE12]. For interpreting and understanding the result of the meta-analysis, it is informative to look at the P-value and m-value at the same time. We propose to apply the PM-plot framework [HE12], which plots the P- values and m-values of each study together in two dimensions. Figure 6.1 (b) shows one example of a PM-plot. In this example, studies with an m-value less than 0.1 are interpreted as studies not having an effect while studies with an m-value greater than 0.9 are interpreted as studies having an effect. For studies with an m-value between 0.1 and 0.9, we cannot make a decision. One reason that studies are ambiguous (0.9 ≤ m-value ≤ 0.1) is that they are underpowered due to small sample size. If the sample size increases, the study can be drawn to either the left or the right side.

131 6.5 Tables

132 6.6 Figures

Chr1: % ''$!"'(*+"!+%$%#&) (Meta P = 4.41 x 10 −22 )' PM −Plot Gene : Apoa2

P −value Study Name 0.0886 ●1.HMDPxB −chow(M) 0.1112 ●2.HMDPxB −chow(F) 0.0054 ● 5 3.HMDPxB −ath(M) ● ● 0.0001 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 6.84 x 10 −9 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.0029 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.0002 ●7.HMDP −fat(F) 0.3805 ●8.BXD −db −12(M) 0.6348 ●9.BXD −db −12(F) )

−5 p 2.50 x 10 ●10.BXD −db −5(M) ( 10 0.6062 ●11.BXD −db −5(F) 10 log ● 0.0001 ●12.BXH −apoe(M) − 12 0.0016 ●13.BXH −apoe(F) ● 0.5437 ●14.BXH −wt(M) ●4 7 0.2249 ●15.BXH −wt(F) 13 ● 0.1882 ●16.CXB −ldlr(M) ● 0.4412 ●17.CXB −ldlr(F) ●6 3

1 ● RE Summary 2 ● ● 15 ● 16 8 ● ● 17 14 11●●● 9 0 2 4 6 8 10

−0.4 0.0 0.4 0.8 0.0 0.2 0.4 0.6 0.8 1.0

!"# Log odds ratio !$# m−value

Figure 6.1: Application of Meta-GxE to Apoa2 locus. The forest plot (A) shows het- erogeneity in the effect sizes across different studies. The PM-plot (B) predicts that 7 studies have an effect at this locus, even though only 1 study (HMDP-chow(M)) is genome-wide significant with P-value.

133 Figure 6.2: The prevalence of heterogeneity in effect size of significant loci. Each dot represents association between SNPs and HDL phenotype from applying random ef- fects based meta-analysis approach. Dots with larger I2 value represents the existence of more heterogeneity at the locus between studies. The distribution of the hetero- geneity statistic for significant SNPs (red dots) in the meta analysis is skewed toward higher heterogeneity while the non-significant SNPs are much less skewed.

134 ● ● ● ● ● ● ● ● ● ●

● ●

● ● Power ● !#&!%& &"$# Effect ● %$%,)%*&'*+(*!(('&-*#).'

● 0.3 ● ● ● 0.24 ● 0.18 ● ● 0.12 ● ● ● 0.06 0.0 0.2 0.4 0.6 0.8 1.0

0.06 0.12 0.18 0.24 0.3

"'%* Effect

Figure 6.3: Power of mouse meta-analysis to identify gene-by-environment interac- tions in 4,965 animals from 17 studies under varying mean effect sizes and the per study variance of the effect size which corresponds to gene-by-environment effects.

135 0.3 0.3

0.2 0.2

RE FE

0.75 0.75 0.50 0.50 0.25 0.25 !'"+%$*+&&%#, !'"+%$*+&&%#, 0.00 0.00 0.1 0.1

0.0 0.0

0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 (a) ),%+"#,(*)*+&&%#, (b) ),%+"#,(*)*+&&%#,

0.3 0.3

0.2 0.2

HE !" RE HE 0.75 0.75 0.50 0.50 0.25 !'"+%$*+&&%#,

!'"+%$*+&&%#, 0.25 0.00 0.00 0.1 0.1

0.0 0.0

0.0 0.1 0.2 0.3 0.0 0.1 0.2 0.3 (c) ),%+"#,(*)*+&&%#, (d) ),%+"#,(*)*+&&%#,

Figure 6.4: Power of (a) random-effect, (b) fixed-effect meta-analysis and (c) hetero- geneity meta-analysis methods as a function of the effect size and the strength of the interaction effect (heterogeneity). (d) shows a comparison of the three methods with the color corresponding to the method with the highest power.

136 !%" '&'$#!

(a)

(b)

(c)

Figure 6.5: Manhattan plots showing the results of Meta-GxE applied to (a) 17 HDL studies, (b) 9 HDL studies consisting only of male animals and (c) 8 studies consisting only of female animals.

137 Chr8: %"#$! "%'')*!!"!#%#&( (Meta P = 4.94 x 10 −11 ) PM −Plot 'Gene : Prkaca

P −value Study Name 0.0060 ●1.HMDPxB −chow(M) 0.7069 ●2.HMDPxB −chow(F) 0.0049 ●3.HMDPxB −ath(M) ● 0.5384 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 0.0272 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.0364 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.0540 ●7.HMDP −fat(F) 0.2936 ●8.BXD −db −12(M) 0.0424 ●9.BXD −db −12(F) ) p 0.1029 ●10.BXD −db −5(M) ( 10 0.0102 ●11.BXD −db −5(F) log

0.8838 ●12.BXH −apoe(M) − 0.6201 ●13.BXH −apoe(F) 0.0329 ●14.BXH −wt(M) 0.0707 ●15.BXH −wt(F)

0.0714 ●16.CXB −ldlr(M) 3 0.4422 ●17.CXB −ldlr(F) ● ●1 14115 ● ● 16 ●96 15 ●7 RE Summary 8 10 ● ● ●134 ● 2 ●● 12 17 0 2 4 6 8 10

−0.4 −0.2 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.6: Peak SNP in chromosome 8 shows interesting gene-by-environment inter- actions between sex × mutation-driven LDL levels.

138 6.7 Supplementary Legends

139 Table 6.1: 17 HDL studies for meta analysis

Study ID Strains Conditions Age Sex # Strains # Samples # Sig Loci Ref

HMDPxB-chow(M) (HMDP x BL/6) F1 Leiden/CETP TG,chow diet 8 weeks M 97 516 1 U

HMDPxB-chow(F) (HMDP x BL/6) F1 Leiden/CETP TG, chow diet 8 weeks F 95 468 0 U

HMDPxB-ath(M) (HMDP x BL/6) F1 Leiden/CETP TG, highfat diet 24 weeks M 97 408 0 U

HMDPxB-ath(F) (HMDP x BL/6) F1 Leiden/CETP TG, highfat diet 24 weeks F 93 457 3 U

HMDP-chow(M) HMDP chow diet 12 weeks M 111 749 6 [BFO10b]

HMDP-fat(M) HMDP highfat diet 16 weeks M 106 586 0 [PNO13]

HMDP-fat(F) HMDP highfat diet 16 weeks F 92 475 0 [PNO13]

BxD-db-12(M) (DBA x BL/6) F2 BXD db/db, chow diet 12 weeks M 125 125 0 [DNC12]

BxD-db-12(F) (DBA x BL/6) F2 BXD db/db, chow diet 12 weesk F 122 122 0 [DNC12]

BxD-db-5(M) (DBA x BL/6) F2 BXD db/db, chow diet 5 weeks M 109 109 1 [DNC12]

BxD-db-5(F) (DBA x BL/6) F2 BXD db/db, chow diet 5 weeks F 139 139 0 [DNC12]

BxH-apoe(M) (C3H x BL/6) F2 BXH Apoe -/- 24 weeks M 161 161 0 [WYS06]

BxH-apoe(F) (C3H x BL/6) F2 BXH Apoe -/- 24 weeks F 174 174 0 [WYS06]

BxH-wt(M) (C3H x BL/6) F2 BXH wildtype, highfat diet 20 weeks M 164 164 0 [NIS10]

BxH-wt(F) (C3H x BL/6) F2 BXH wildtype, highfat diet 20 weeks F 144 144 0 [NIS10]

CxB-ldlr(M) (BALB/cJ x BL/6) F2 CXB LDLR -/-, highfat diet 12 weeks M 124 124 0 U

CxB-ldlr(F) (BALB/cJ x BL/6) F2 CXB LDLR -/-, highfat diet 12 weeks F 64 64 0 U

Seventeen HDL studies are combined in the meta analysis. U in the Ref column represents a data set that is not yet published. Mice for the HMDPxB panel were created by breeding fe- males of the various HMDP inbred strains to males carrying transgenes for both Apoe Leiden and for human Cholesterol Ester Transfer Protein (CETP) on a C57BL/6 genetic background. The Leiden/CETP transgenes [MHK93, JAW92] cause high total cholesterol level and high LDL cholesterol level in the circulation, along with reduced HDL cholesterol. BxD db/db denotes a population of F2 mice from a cross between C57BL/6 DBA/2 with homozygous de- ficiency in leptin receptor (db/db), which results in obese mice. BxH Apoe -/- denotes denotes a population of F2 mice from a cross between C57BL/6 and C3H also carrying a deficiency in apolipoprotein E. CxB LDLR -/- denotes a population of F2 mice from a cross between C57BL/6 and BALB/cBy also carrying a deficiency in LDL receptor, which results in high LDL cholesterol level in the circulation BXH wildtype denotes a population of F2 mice from a cross between C57BL/6 and C3H.

140 Table 6.2: 26 significant loci identified by applying Meta-GxE analysis

SNP Meta GxE P Meta GxE P Meta GxE P # of Studies w/ HE Meta P # Studies Genes Gene Location (Male) (Female) (Male+Female) Significant P (Male+Female) E/A/N in Region Refs Chr1:64752822 (rs31078051) 1.12×10−6 2.37×10−3 9.82×10−8 0 3.82×10−3 2/14/1 Pikfyve [THS13] Chr1:107271282 (rs32203839) 6.69×10−4 2.66×10−4 7.67×10−7 0 7.55×10−2 6/0/11 Bcl2 [HBM12] Chr1:171199523 (rs32075748) 3.45×10−16 1.39×10−7 4.41×10−22 1 5.80×10−5 8/6/3 Apoa2 [PAO97] Chr2:77837584 (rs6273567) 1.55×10−4 7.17×10−4 5.74×10−7 0 2.25×10−2 3/14/0 Agps [LCD11] Chr2:134421733 (rs27238693) 7.69×10−6 1.66×10−3 1.09×10−7 0 1.63×10−2 5/12/0 Jag1 [HZK10] Chr3:32944259 (rs29869794) 2.97×10−6 2.49×10−2 3.68×10−7 0 2.36×10−5 3/8/6 Prkci [FSY07] Chr3:76066632 (rs31487078) 5.89×10−3 2.29×10−5 5.03×10−7 0 6.56×10−2 4/13/0 Novel - Chr3:107430396 (rs30013147) 9.59×10−6 3.84×10−5 1.56×10−9 0 7.74×10−2 7/10/0 Csf1 [QTM97] Chr3:143466942 (rs30206761) 1.82×10−3 3.97×10−5 3.34×10−7 0 8.60×10−2 7/10/0 Hs2st1 [SWC10] Chr4:131925523 (rs32595861) 1.72×10−4 2.84×10−4 1.41×10−7 0 8.65×10−4 6/8/3 Fabp3 [MBB10] Chr5:119034507 (rs33131194) 1.23×10−4 2.59×10−3 9.00×10−7 0 3.94×10−1 9/8/0 Nos1 [LGP12, NTS08] Chr8:46903188 (rs33272858) 1.47×10−7 6.52×10−1 1.66×10−6 1 1.60×10−4 2/11/4 Acsl1 [LEP09] Chr8:64150094 (rs31750594) 1.96×10−7 1.89×10−4 1.33×10−10 0 8.34×10−1 11/6/0 Cpe [NNV94] Chr8:84073148 (rs33435859) 1.95×10−8 4.53×10−4 4.94×10−11 0 8.33×10−1 12/5/0 Prkaca [NWW05] Chr9:101972687 (rs6333310) 1.22×10−4 1.22×10−5 4.05×10−9 0 1.98×10−8 2/1/14 Pik3cb [CIM08] Chr10:21399819 (rs29363941) 9.07×10−4 3.64×10−7 3.36×10−9 0 1.18×10−2 3/12/2 Ifngr1 [GPJ97] Chr10:90146088 (rs29370592) 1.93×10−7 0.756 1.02×10−5 1 8.94×10−4 2/14/1 Nr1h4 [KAI07] Chr11:69906552 (rs29477071) 5.77×10−8 1.35×10−5 3.17×10−12 0 2.37×10−9 6/9/2 Plscr3 [WZL04] Chr11:114083173 (rs29416888) 0.00011 7.83×10−5 1.71×10−7 0 5.28×10−5 3/13/1 Acox1 [FPC96] Chr14:33632464 (rs31061259) 1.96×10−4 1.65×10−3 8.90×10−7 0 2.02×10−5 3/10/4 Ppyr1 [SBS03] Chr15:21194226 (rs31670969) 1.96×10−8 1.29×10−2 8.97×10−10 1 5.65×10−7 3/2/12 Novel - Chr15:59860191 (rs3718217) 5.64×10−6 1.45×10−5 5.31×10−10 0 9.92×10−5 5/10/2 Trib1, Sqle [EBS11, FRH13] Chr17:46530712 (rs33259313) 1.09×10−5 0.0049 3.26×10−7 0 3.53×10−5 5/10/2 Gnmt [LLC07] Chr18:82240606 (rs13483466) 2.05×10−4 2.23×10−4 1.30×10−7 0 9.05×10−1 5/12/0 Mbp [BCG89] Chr19:3319089 (rs31004232) 5.58×10−2 8.53×10−7 4.56×10−7 0 1.08×10−1 3/14/0 Lrp5 [FAK03] ChrX:151384614 (rs31202008) 2.59×10−4 4.72×10−6 8.09×10−9 0 1.22×10−1 5/5/0 Htr2c [KGT08]

Twentysix significant loci identified by applying Meta-GxE analysis of both random effects meta-analysis and heterogeneity testing to 17 mouse HDL studies under different environ- ments containing 4,965 total animals. # studies E denotes the number of studies with an effect on HDL phenotype. # studies N denotes the number of studies with no effect on HDL pheno- type. # studies A denotes the number of studies with an ambiguous effect size. Genes in region denotes candidate genes for each locus based on close proximity to the peak SNP and previ- ously suggested role in lipid or apolipoprotein metabolism: Pikfyve (phosphoinositide kinase), Bcl2 (B cell leukemia/lymphoma 2), Apoa2 (apolipoprotein A-II), Agps (alkylglycerone phos- phate synthase), Jag1 ( jagged 1), Prkci (protein kinase C), Prkci (colony stimulating factor 1 (macrophage)), Hs2st1 (heparan sulfate 2-O-sulfotransferase 1), Fabp3 (fatty acid binding protein 3), Nos1 (nitric oxide synthase 1), Acsl1 (acyl-CoA synthetase long-chain family mem- ber 1), Cpe (carboxypeptidase E), Prkaca (protein kinase, cAMP dependent, catalytic, alpha), Acox1 (peroxisomal acyl-coenzyme A oxidase 1), Ppyr1 (pancreatic polypeptide receptor 1), Trib1(tribbles homolog 1), Sqle (squalene epoxidase), Gnmt (glycine N-methyltransferase), Mbp(myelin basic protein), Lrp5 (low density lipoprotein receptor-related protein 5), Htr2c (5-hydroxytryptamine (serotonin) receptor 2C) 141 Table 6.3: False Positive Rate of RE versus Traditional Wald Test Based Approaches at Tresholds of Increasing Significance

# of Knockout RE Wald test

Studies 2 3 4 5 2 3 4 5

Threshold α

5.0 × 10−2 5.00 × 10−2 4.98 × 10−2 4.99 × 10−2 4.99 × 10−2 5.07 × 10−2 5.05 × 10−2 5.06 × 10−2 5.05 × 10−2

5.0 × 10−3 4.96 × 10−3 5.00 × 10−3 4.98 × 10−3 4.94 × 10−3 5.16 × 10−3 5.18 × 10−3 5.17 × 10−3 5.12 × 10−3

5.0 × 10−4 5.03 × 10−4 4.99 × 10−4 5.04 × 10−4 4.99 × 10−4 5.27 × 10−4 5.26 × 10−4 5.20 × 10−4 5.19 × 10−4

5.0 × 10−5 4.93 × 10−5 4.93 × 10−5 4.89 × 10−5 5.14 × 10−5 5.37 × 10−5 5.36 × 10−5 5.21 × 10−5 5.31 × 10−5

False positive rate is measured as we increase the number of knockout studies having interaction effects with 100 million null panel simulations.

Table 6.4: False Positive Rate of HE versus Traditional Wald Test Based Approaches at Tresholds of Increasing Significance

# of Knockout HE Wald test

Studies 2 3 4 5 2 3 4 5

Threshold α

5.0 × 10−2 5.01 × 10−2 4.99 × 10−2 4.98 × 10−2 4.98 × 10−2 5.06 × 10−2 5.03 × 10−2 5.05 × 10−2 5.03 × 10−2

5.0 × 10−3 4.98 × 10−3 5.01 × 10−3 4.97 × 10−3 4.97 × 10−3 5.17 × 10−3 5.15 × 10−3 5.11 × 10−3 5.12 × 10−3

5.0 × 10−4 4.99 × 10−4 4.98 × 10−4 4.98 × 10−4 5.15 × 10−4 5.21 × 10−4 5.24 × 10−4 5.30 × 10−4 5.19 × 10−4

5.0 × 10−5 4.91 × 10−5 4.95 × 10−5 4.86 × 10−5 4.99 × 10−5 5.35 × 10−5 5.32 × 10−5 5.27 × 10−5 5.15 × 10−5

False positive rate is measured as we increase the number of knockout studies having interaction effects with 100 million null panel simulations.

142 RE H! FE Power

0.0 0.24 0.4 0.6 0.8 1.0 3 2 Number of studies having interaction effect

Figure 6.7: Power comparison between random-effect, fixed-effect meta-analysis and heterogeneity testing

143 Figure 6.8: The association result of the fixed effects meta analysis.

144 RE Trad Power

0.0 0.25 0.4 0.6 0.8 4 1.0 3 2 Number of studies having interaction effect

Figure 6.9: Power comparison between random-effect meta-analysis and traditional wald test based approach

145 HE Trad Power

0.0 0.25 0.4 0.6 0.8 4 1.0 3 2 Number of studies having interaction effect

Figure 6.10: Power comparison between heterogeneity testing approach and traditional wald test based approach

146 Chr1: %#&$!'!!'(*+" -&'-$ ) (Meta P = 9.82 x 10 −8 ) PM −Plot Gene : Pikfyve

P −value Study Name 0.3155 ●1.HMDPxB −chow(M) 0.1581 ●2.HMDPxB −chow(F) 0.3247 ●3.HMDPxB −ath(M) ● 0.0904 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 1.12 x 10 −5 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) −5 2.79 x 10 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.1418 ●7.HMDP −fat(F) 0.9297 ●8.BXD −db −12(M) 0.4977 ●9.BXD −db −12(F) ) p 0.1265 ●10.BXD −db −5(M) ( 5 10 ● 0.4312 ●11.BXD −db −5(F) log ● 0.5027 ●12.BXH −apoe(M) − 6 0.0532 ●13.BXH −apoe(F) 0.4399 ●14.BXH −wt(M) 0.1944 ●15.BXH −wt(F) 0.1308 ●16.CXB −ldlr(M) 0.5842 ●17.CXB −ldlr(F)

● 13 ● 4 RE Summary ● 10 15 7 ●● 16 ● 2 ● ●●3 14 11 ● ● ● 9 ● 12 ●117 8 0 2● 4 6 8 10

−0.8 −0.4 0.0 0.4 0.8 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.11: Forest plot and PM-plot for Chr1:64752822 locus

147 Chr1: "#!# !$!&&()"!!""$"%' (Meta P = 7.67 x 10 −7 ) PM −Plot Gene : Bcl2

P −value Study Name 0.3187 ●1.HMDPxB −chow(M) 0.3466 ●2.HMDPxB −chow(F) 0.8350 ●3.HMDPxB −ath(M) ● 0.0345 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 0.0016 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.0234 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.0020 ●7.HMDP −fat(F) 0.1256 ●8.BXD −db −12(M) 0.2428 ●9.BXD −db −12(F) ) p 0.9241 ●10.BXD −db −5(M) ( 10 0.6261 ●11.BXD −db −5(F) log

0.0137 ●12.BXH −apoe(M) − 0.0314 ●13.BXH −apoe(F) 0.2863 ●14.BXH −wt(M) 0.1820 ●15.BXH −wt(F) ● 5 0.9166 ●16.CXB −ldlr(M) 7 ●

0.4062 ●17.CXB −ldlr(F) 12 ● 13● 6 ● 4 ● 8 RE Summary 9 15 ● 14 ● ● 1 ●●17 2● ● ● 16 10 0 2● 4● 11 6 8 10 3

−0.6 −0.2 0.2 0.6 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.12: Forest plot and PM-plot for Chr1:107271282 locus

148 Chr1: % ''$!"'(*+"!+%$%#&) '(Meta P = 4.41 x 10 −22 )' PM −Plot Gene : Apoa2

P −value Study Name 0.0886 ●1.HMDPxB −chow(M) 0.1112 ●2.HMDPxB −chow(F) 0.0054 ● 5 3.HMDPxB −ath(M) ● ● 0.0001 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 6.84 x 10 −9 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.0029 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.0002 ●7.HMDP −fat(F) 0.3805 ●8.BXD −db −12(M) 0.6348 ●9.BXD −db −12(F) )

−5 p 2.50 x 10 ●10.BXD −db −5(M) ( 10 0.6062 ●11.BXD −db −5(F) 10 log ● 0.0001 ●12.BXH −apoe(M) − 12 0.0016 ●13.BXH −apoe(F) ● 0.5437 ●14.BXH −wt(M) ●4 7 0.2249 ●15.BXH −wt(F) 13 ● 0.1882 ●16.CXB −ldlr(M) ● 0.4412 ●17.CXB −ldlr(F) ●6 3

1 ● RE Summary 2 ● ● 15 ● 16 8 ● ● 17 14 11●●● 9 0 2 4 6 8 10

−0.4 0.0 0.4 0.8 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.13: Forest plot and PM-plot for Chr1:171199523 locus

149 Chr2: $$% $"%!&&()#+$ "#$' (Meta P = 5.74 x 10 −7 ) PM −Plot Gene : Agps

P −value Study Name 0.0741 ●1.HMDPxB −chow(M) 0.1290 ●2.HMDPxB −chow(F) 0.0066 ●3.HMDPxB −ath(M) ● 0.0004 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 8.47 x 10 −5 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.1340 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.2026 ●7.HMDP −fat(F) 0.6311 ●8.BXD −db −12(M) 0.1674 ●9.BXD −db −12(F) ) p 0.6947 ●10.BXD −db −5(M) ( 10 0.2063 ●11.BXD −db −5(F) log

0.5645 ●12.BXH −apoe(M) − 5 ● 0.7972 ●13.BXH −apoe(F) 0.1601 ● 14.BXH −wt(M) ● 0.5107 ●15.BXH −wt(F) 4 0.6183 ●16.CXB −ldlr(M) 0.7431 ●17.CXB −ldlr(F) ● 3

1 9 ● RE Summary 14 6 ● 2 ● 11 ●● 12 7 15 ●● 16 13 ●● 17●●● 0 2 4810 6 8 10

−1.0 −0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.14: Forest plot and PM-plot for Chr2:77837584 locus

150 Chr2: !!"" !$!!&')* $ !%#&!( (Meta P = 1.09 x 10 −7 ) PM −Plot Gene : Jag1

P −value Study Name 0.0297 ●1.HMDPxB −chow(M) 0.3325 ●2.HMDPxB −chow(F) 0.0751 ●3.HMDPxB −ath(M) ● 0.0045 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 4.68 x 10 −6 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.0078 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.1849 ●7.HMDP −fat(F) 0.8027 ●8.BXD −db −12(M) 0.0396 ●9.BXD −db −12(F) ) 5 p 0.2687 ●10.BXD −db −5(M) ( ● 10 0.2317 ●11.BXD −db −5(F) log

0.2900 ●12.BXH −apoe(M) − 0.6569 ●13.BXH −apoe(F) 0.8618 ●14.BXH −wt(M) 0.6674 ●15.BXH −wt(F) 0.5602 ●16.CXB −ldlr(M) 4 ● 0.4499 ●17.CXB −ldlr(F) ● 1 6 ●● 3 ● 9 RE Summary 11 ● ● 7 12 ● 2 ● ● 10 ● 17 13 15 16 ● ●● 8● 14 ● 0 2 4 6 8 10

−1.0 −0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.15: Forest plot and PM-plot for Chr2:134421733 locus

151 Chr3: "&!!""&&')*"&%#&$&!( (Meta P = 3.68 x 10 −7 ) PM −Plot Gene : Prkci

P −value Study Name 0.0238 ●1.HMDPxB −chow(M) 0.4088 ●2.HMDPxB −chow(F) 0.0032 ●3.HMDPxB −ath(M) ● 0.0012 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 4.32 x 10 −6 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.0556 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.5946 ●7.HMDP −fat(F) 0.1281 ●8.BXD −db −12(M)

0.2200 ●9.BXD −db −12(F) 5 )

p ● 0.4417 ●10.BXD −db −5(M) ( 10 0.1594 ●11.BXD −db −5(F) log

0.6796 ●12.BXH −apoe(M) − 0.2184 ●13.BXH −apoe(F) 0.1437 ●14.BXH −wt(M) 0.7704 ●15.BXH wt(F) 4 − ● 0.9418 ● 16.CXB −ldlr(M) ● 0.4041 ●17.CXB −ldlr(F) 3

● 1 6 ● RE Summary 14 ● 8 ● 11 13 ●109 2 ● 17 ● ● ● ●15 ● 16 012 ● 27 4 6 8 10

−1.0 −0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.16: Forest plot and PM-plot for Chr3:32944259 locus

152 Chr3: %$#$$$"!&')*" #&%#%&( (Meta P = 5.03 x 10 −7 ) PM −Plot

P −value Study Name 0.8184 ●1.HMDPxB −chow(M) 0.0016 ●2.HMDPxB −chow(F) 0.9053 ●3.HMDPxB −ath(M) ● 0.0010 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 0.1029 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.0247 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.0124 ●7.HMDP −fat(F) 0.1570 ●8.BXD −db −12(M) 0.5697 ●9.BXD −db −12(F) ) p 0.1676 ●10.BXD −db −5(M) ( 10 0.8794 ●11.BXD −db −5(F) log

0.3458 ●12.BXH −apoe(M) − 0.0320 ●13.BXH −apoe(F) 0.0552 ●14.BXH −wt(M) 4 0.4108 ●15.BXH −wt(F) ● ● 0.6148 ●16.CXB −ldlr(M) 2 0.3631 ●17.CXB −ldlr(F) 7 ● 13 ●● ● 6 8● 514 RE Summary ● 10 15 ● ● 17 ● 12 ● 9 16 ● 1 ● 3 ● 11 0● 2 4 6 8 10

−0.8 −0.4 0.0 0.4 0.8 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.17: Forest plot and PM-plot for Chr3:76066632 locus

153 Chr3: "$"!"!%#(&()!"" ! "$' (Meta P = 1.56 x 10 −9 ) PM −Plot Gene : Csf1

P −value Study Name 0.1476 ●1.HMDPxB −chow(M) 0.0255 ●2.HMDPxB −chow(F) 0.0598 ●3.HMDPxB −ath(M) ● 0.0005 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 0.1787 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.0033 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.0015 ●7.HMDP −fat(F) 0.3065 ●8.BXD −db −12(M) 0.4736 ●9.BXD −db −12(F) ) p 0.0036 ●10.BXD −db −5(M) ( 10 0.4648 ●11.BXD −db −5(F) log

0.0233 ●12.BXH −apoe(M) − 0.3078 ●13.BXH −apoe(F) 0.6309 ●14.BXH −wt(M) 4 ● 0.5630 ●15.BXH −wt(F) 7 ● 0.7169 ●16.CXB −ldlr(M) 10 ● 6 0.4365 ●17.CXB −ldlr(F)

12 ●● 2 ● 1 3 RE Summary ●● 5 17 8 ●● 13 9 ● ●11 ● 14 ● ● ● 15 0 2 416 6 8 10

−0.8 −0.4 0.0 0.4 0.8 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.18: Forest plot and PM-plot for Chr3:107430396 locus

154 Chr3: #"#$$&#!'')*"+!+$%$ ( (Meta P = 3.34 x 10 −7 ) PM −Plot Gene : Hs2st1

P −value Study Name 0.0098 ●1.HMDPxB −chow(M) 0.0010 ●2.HMDPxB −chow(F) 0.3155 ●3.HMDPxB −ath(M) ● 0.0066 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 0.0029 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.0585 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.0313 ●7.HMDP −fat(F) 0.4745 ●8.BXD −db −12(M) 0.1758 ●9.BXD −db −12(F) ) p 0.7804 ●10.BXD −db −5(M) ( 10 0.3658 ●11.BXD −db −5(F) log

0.9740 ●12.BXH −apoe(M) − 0.6590 ●13.BXH −apoe(F) 0.0765 ●14.BXH −wt(M) 2 0.7869 ●15.BXH −wt(F) ● 5 0.5722 ●16.CXB −ldlr(M) ●4 0.4753 ●17.CXB −ldlr(F) ● ● 7 1 ● 6 14 ●● RE Summary 9 ● 3 ● 11 ● 16 ●● 8● 17 10 ● ●13 15 ●

0 212 ● 4 6 8 10

−1.0 −0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.19: Forest plot and PM-plot for Chr3:143466942 locus

155 Chr4: !!!% "" !&&()! "%"$#!' (Meta P = 1.41 x 10 −7 ) PM −Plot Gene : Fabp3

P −value Study Name 0.0004 ●1.HMDPxB −chow(M) 0.0219 ●2.HMDPxB −chow(F) 0.8517 ●3.HMDPxB −ath(M) ● 0.0013 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 0.0046 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.0016 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.0313 ●7.HMDP −fat(F) 0.3556 ●8.BXD −db −12(M) 0.1820 ●9.BXD −db −12(F) ) p 0.5774 ●10.BXD −db −5(M) ( 10 0.7739 ●11.BXD −db −5(F) log

0.7152 ●12.BXH −apoe(M) − 0.8704 ●13.BXH −apoe(F) 0.4636 ●14.BXH −wt(M) 1 ● 0.0149 ●15.BXH −wt(F) 6 ●● 4 0.7775 ●16.CXB −ldlr(M) 0.6493 ● ●17.CXB −ldlr(F) 5 ● 15 2 ● 7 ●

9 RE Summary ● 10 ● ● 8 ● 121411●●●3 13● ● 16 0 217 4 6 8 10

−1.0 −0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.20: Forest plot and PM-plot for Chr4:131925523 locus

156 Chr5: %#!"##$(&()!! ! %"' (Meta P = 9.00 x 10 −7 ) PM −Plot Gene : Nos1

P −value Study Name 0.4003 ●1.HMDPxB −chow(M) 0.0332 ●2.HMDPxB −chow(F) 0.1139 ●3.HMDPxB −ath(M) ● 0.0057 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 0.3659 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.2297 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.1972 ●7.HMDP −fat(F) 0.0669 ●8.BXD −db −12(M) 0.2066 ●9.BXD −db −12(F) ) p 0.0626 ●10.BXD −db −5(M) ( 10 0.6380 ●11.BXD −db −5(F) log

0.0765 ●12.BXH −apoe(M) − 0.0638 ●13.BXH −apoe(F) 0.8365 ●14.BXH −wt(M) 0.9570 ●15.BXH −wt(F) 0.0054 ●16.CXB −ldlr(M) 0.6588 ●17.CXB −ldlr(F) 16 ●● 4 2 ● 13 ● 1210 ●● 8 RE Summary 7● 1 56 ●●3 ●11 ● 9 ●●● 17 0 2 415 ● 6 8 10 14

−0.8 −0.4 0.0 0.4 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.21: Forest plot and PM-plot for Chr5:119034507 locus

157 Chr8: #%($" ''()+,""!&!'$'* (Meta P = 1.66 x 10 −6 ) PM −Plot Gene : Acsl1

P −value Study Name 0.1263 ●1.HMDPxB −chow(M) 0.3183 ●2.HMDPxB −chow(F) 0.4901 ●3.HMDPxB −ath(M) ● 0.1199 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 1.57 x 10 −7 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.0003 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 5 ● 0.1064 ●7.HMDP −fat(F) 0.6675 ●8.BXD −db −12(M) 0.8443 ●9.BXD −db −12(F) ) p 0.6717 ●10.BXD −db −5(M) ( 10 0.5933 ●11.BXD −db −5(F) log

0.4940 ●12.BXH −apoe(M) − 0.3661 ●13.BXH −apoe(F) 0.6308 ●14.BXH −wt(M) ● 6 0.1943 ●15.BXH −wt(F) 0.0526 ●16.CXB −ldlr(M) 0.5161 ●17.CXB −ldlr(F)

● 16 RE Summary 15 4 ● ●1 7 ●13 2 ● ● 312 ● 11 10● 8● ● ● ● 9 ●● 17 014 2 4 6 8 10

−1.0 −0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.22: Forest plot and PM-plot for Chr8:46903188 locus

158 Chr8: $" #%%&"'')*! %#%#&"( (Meta P = 1.33 x 10 −10 ) PM −Plot Gene : Cpe

P −value Study Name 0.0016 ●1.HMDPxB −chow(M) 0.0811 ●2.HMDPxB −chow(F) 0.0019 ●3.HMDPxB −ath(M) ● 0.0629 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 0.0316 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.9226 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.4014 ●7.HMDP −fat(F) 0.6251 ●8.BXD −db −12(M) 0.1384 ●9.BXD −db −12(F) ) p 0.0742 ●10.BXD −db −5(M) ( 10 0.0132 ●11.BXD −db −5(F) log

0.6100 ●12.BXH −apoe(M) − 0.8396 ●13.BXH −apoe(F) 0.0536 ●14.BXH −wt(M) 0.1869 ●15.BXH −wt(F) 1 0.0368 ● ● 16.CXB −ldlr(M) 3 0.7220 ●17.CXB −ldlr(F) 11 ● 5 14● 10● ●●16 4 RE Summary ●2 9 157 ● 12 ● ● 6 13 ● ● 17 0 2 4● 8 6 8 10

−0.6 −0.4 −0.2 0.0 0.2 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.23: Forest plot and PM-plot for Chr8:64150094 locus

159 Chr8: %"#$! "%'')*!!"!#%#&( (Meta P = 4.94 x 10 −11 ) PM −Plot 'Gene : Prkaca

P −value Study Name 0.0060 ●1.HMDPxB −chow(M) 0.7069 ●2.HMDPxB −chow(F) 0.0049 ●3.HMDPxB −ath(M) ● 0.5384 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 0.0272 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.0364 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.0540 ●7.HMDP −fat(F) 0.2936 ●8.BXD −db −12(M) 0.0424 ●9.BXD −db −12(F) ) p 0.1029 ●10.BXD −db −5(M) ( 10 0.0102 ●11.BXD −db −5(F) log

0.8838 ●12.BXH −apoe(M) − 0.6201 ●13.BXH −apoe(F) 0.0329 ●14.BXH −wt(M) 0.0707 ●15.BXH −wt(F)

0.0714 ●16.CXB −ldlr(M) 3 0.4422 ●17.CXB −ldlr(F) ● ●1 14115 ● ● 16 ●96 15 ●7 RE Summary 8 10 ● ● ●134 ● 2 ●● 12 17 0 2 4 6 8 10

−0.4 −0.2 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.24: Forest plot and PM-plot for Chr8:84073148 locus

160 Chr9: " &$!#%$(')*#"""" "( (Meta P = 4.05 x 10 −9 ) PM −Plot Gene : Pik3cb

P −value Study Name 0.3624 ●1.HMDPxB −chow(M) 0.0024 ●2.HMDPxB −chow(F) 0.0491 ●3.HMDPxB −ath(M) −6 ● 2.63 x 10 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 1.20 x 10 −5 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.0051 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.1246 ●7.HMDP −fat(F) 0.7851 ●8.BXD −db −12(M) 4 0.7101 ●9.BXD −db −12(F) ) ● p 0.6588 ●10.BXD −db −5(M) ( 10 0.7900 ●11.BXD −db −5(F) ●

log 5 0.2644 ●12.BXH −apoe(M) − 0.0457 ●13.BXH −apoe(F) 0.1580 ●14.BXH −wt(M) 0.5291 ●15.BXH −wt(F) 2 0.2203 ●16.CXB −ldlr(M) ● 0.7048 ● ● 17.CXB −ldlr(F) 6

13 ● ● 3 14 RE Summary 16● 7 ● ● 15 10 ● 1 12● 9 ●●8 ● 17 011 2 4 6 8 10

−1.0 −0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.25: Forest plot and PM-plot for Chr9:101972687 locus

161 Chr10: "!%%$"%&&() %!#!%""' (Meta P = 3.36 x 10 −9 ) PM −Plot Gene : Ifngr1

P −value Study Name 0.0036 ●1.HMDPxB −chow(M) 5.10 x 10 −5 ●2.HMDPxB −chow(F) 0.8923 ●3.HMDPxB −ath(M) −6 ● 7.69 x 10 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 0.2037 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.0434 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.0887 ●7.HMDP −fat(F) 0.2955 ●8.BXD −db −12(M) 0.5558 ●9.BXD −db −12(F) )

p 4 0.5669 ●10.BXD −db −5(M) (

10 ● 0.6569 ●11.BXD −db −5(F) log 0.1185 ●12.BXH −apoe(M) − ● 0.3642 ●13.BXH −apoe(F) 2 0.7105 ●14.BXH −wt(M) 0.1983 ●15.BXH −wt(F) 0.2788 ●16.CXB −ldlr(M) ● 1 0.5708 ●17.CXB −ldlr(F)

● 6 ● 7 RE Summary 12 ● 15 ● 5 16 ● ● 13 ● ● 8 ● 9 11 ●10 17● 3 14 ● ●

0● 2 4 6 8 10

−0.6 −0.2 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.26: Forest plot and PM-plot for Chr10:21399819 locus

162 Chr10: (" #%"''')+,!("&"$(!* (Meta P = 1.02 x 10 −5 ) PM −Plot Gene : Nr1h4

P −value Study Name 0.0014 ●1.HMDPxB −chow(M) 0.4386 ●2.HMDPxB −chow(F) 1.63 x 10 −8 ●3.HMDPxB −ath(M) ● Study has an effect (m > .9) 3 0.8324 ●4.HMDPxB −ath(F) ● 0.3846 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.9101 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.7538 ●7.HMDP −fat(F) 0.8431 ●8.BXD −db −12(M) 0.9985 ●9.BXD −db −12(F) ) p 0.1724 ●10.BXD −db −5(M) ( 10 0.6599 ●11.BXD −db −5(F) log

0.8174 ●12.BXH −apoe(M) − 0.6900 ●13.BXH −apoe(F) 0.3482 ●14.BXH −wt(M) 0.9625 ●15.BXH −wt(F) ● 0.7189 ●16.CXB −ldlr(M) 1 0.9435 ●17.CXB −ldlr(F)

RE Summary ● 10 2 5 14 ● 118● ● ● 16 4 ●●●7 17 0●13●●15 2 4 6 8 10 9612

−0.6 −0.2 0.2 0.6 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.27: Forest plot and PM-plot for Chr10:90146088 locus

163 Chr11: $&&#$##!&')*!&"%%#% ( (Meta P = 3.15 x 10 −12 )& PM −Plot Gene : Plscr3

P −value Study Name 0.0055 ●1.HMDPxB −chow(M) 0.0016 ●2.HMDPxB −chow(F) 0.7486 ●3.HMDPxB −ath(M) −5 ● 4.42 x 10 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 1.46 x 10 −5 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) −5 3.58 x 10 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.0173 ●7.HMDP −fat(F) 0.0210 ●8.BXD −db −12(M) 0.5682 ●9.BXD −db −12(F) ) p 0.1628 ●10.BXD −db −5(M) (

10 5 0.0174 ●11.BXD −db −5(F) ●

log 6 0.3876 ●12.BXH −apoe(M) − ● 4 0.7519 ●13.BXH −apoe(F) 0.1999 ●14.BXH −wt(M) 0.9860 ●15.BXH −wt(F) 2 ● 0.9820 ●16.CXB −ldlr(M) 0.9138 ●17.CXB −ldlr(F) ● 11 1 ● 7 ● 8

RE Summary ● 10 12 ● 14 ●9 ● ●● 3 17

0●16 2 4 6 8 10 13●● 15

−1.0 −0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.28: Forest plot and PM-plot for Chr11:69906552 locus

164 Chr11: ##&" %"'(*+!'# $&&&) (Meta P = 1.71 x 10 −7 )' PM −Plot Gene : Acox1

P −value Study Name 0.0614 ●1.HMDPxB −chow(M) 0.0003 ●2.HMDPxB −chow(F) 0.1381 ●3.HMDPxB −ath(M) ● 0.0051 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 5.83 x 10 −5 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.1652 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.9553 ●7.HMDP −fat(F) 0.1955 ●8.BXD −db −12(M) 0.8420 ●9.BXD −db −12(F) ) p 0.1871 ●10.BXD −db −5(M) ( 10 0.9841 ●11.BXD −db −5(F) log

0.6034 ● − 5 12.BXH −apoe(M) ● 0.6461 ●13.BXH −apoe(F) 15 ● 0.8257 ●14.BXH −wt(M) ● 0.0001 ●15.BXH −wt(F) 2 0.4732 ●16.CXB −ldlr(M) 0.8433 ● ● 17.CXB −ldlr(F) 4

● 1 6 RE Summary ● 3 8 ●● 10 ● 16 12 ●● 13 ● 1417 ●● 9 7 0● 211 ● 4 6 8 10

−1.0 −0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.29: Forest plot and PM-plot for Chr11:114083173 locus

165 Chr14: ""%"!#%#%')*" *% !$&( (Meta P = 8.90 x 10 −7 ) PM −Plot Gene : Ppyr1

P −value Study Name 0.0024 ●1.HMDPxB −chow(M) 0.0443 ●2.HMDPxB −chow(F) 0.4586 ●3.HMDPxB −ath(M) ● 0.5898 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 8.98 x 10 −6 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.2027 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.0601 ●7.HMDP −fat(F) 0.4131 ●8.BXD −db −12(M) 0.0077 ●9.BXD −db −12(F) ) p 0.5548 ●10.BXD −db −5(M) ( 5

10 ● 0.2233 ●11.BXD −db −5(F) log

0.2430 ●12.BXH −apoe(M) − 0.3263 ●13.BXH −apoe(F) 0.3125 ●14.BXH −wt(M) 0.0205 ●15.BXH −wt(F) 1 0.6391 ●16.CXB −ldlr(M) ● 0.0645 ●17.CXB −ldlr(F) ● 15 9 ● ● 2 7 ●● 17 RE Summary 14 12 ● ● 11 6 ● ● 13 ● ● 3 ● 8 10 ● 16 ● ●

0 24 4 6 8 10

−1.0 −0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.30: Forest plot and PM-plot for Chr14:33632464 locus

166 Chr15: ! &#!!$&')*" $%,&$&( (Meta P = 8.97 x 10 −10 ) PM −Plot

P −value Study Name 0.0129 ●1.HMDPxB −chow(M) 0.9146 ●2.HMDPxB −chow(F) 0.9382 ●3.HMDPxB −ath(M) ● 0.8063 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 2.19 x 10 −7 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1)

0.0001 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 5 7.66 x 10 −6 ●7.HMDP −fat(F) ● 0.1781 ●8.BXD −db −12(M) 0.5320 ●9.BXD −db −12(F) )

p 7 0.5220 ●10.BXD −db −5(M) (

10 ● 0.4324 ●11.BXD −db −5(F) log

0.1591 ●12.BXH −apoe(M) − 0.4690 ●13.BXH −apoe(F) ● 0.0629 ●14.BXH −wt(M) 6 0.3822 ●15.BXH −wt(F) 0.5986 ●16.CXB −ldlr(M) 0.5456 ●17.CXB −ldlr(F) 1 ● 14 ● 12 RE Summary ● ●15 138 ● ● 11 ●49●●17● 3 ●10 2 0● 16 2 4 6 8 10

−1.0 −0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.31: Forest plot and PM-plot for Chr15:21194226 locus

167 Chr15: $('% !(!'-56#&!'"!&. (Meta P = 5.31 x 10 −10 ) PM −Plot *030')',51/!6'+420

P −value Study Name 0.4284 ●1.HMDPxB −chow(M) 0.1877 ●2.HMDPxB −chow(F) 0.1093 ●3.HMDPxB −ath(M) ● 0.0076 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 1.07 x 10 −6 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.0127 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.0045 ●7.HMDP −fat(F) 0.0641 ● 5 8.BXD −db −12(M) ● 0.6493 ●9.BXD −db −12(F) ) p 0.5837 ●10.BXD −db −5(M) ( 10 0.8222 ●11.BXD −db −5(F) log

0.1008 ●12.BXH −apoe(M) − 0.9961 ●13.BXH −apoe(F) 17 0.5174 ●14.BXH −wt(M) ● 0.0602 ●15.BXH −wt(F) 0.1404 ●16.CXB −ldlr(M) 7 ● 0.0004 ●17.CXB −ldlr(F) ● ●4 6 ● 8 ● 15 12 ● RE Summary 3 ● 16 2 ● ● 14 1 ● ● 9 ● ● 11 013 ● 210 4 6 8 10

−0.8 −0.4 0.0 0.4 0.8 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.32: Forest plot and PM-plot for Chr15:59860191 locus

168 Chr17: #%$"%& !)(*+""!$'" ") (Meta P = 3.26 x 10 −7 ) PM −Plot Gene : Gnmt

P −value Study Name 0.0002 ●1.HMDPxB −chow(M) 0.0234 ●2.HMDPxB −chow(F) 0.5777 ●3.HMDPxB −ath(M) ● 0.0629 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 0.0058 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.0172 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.0401 ●7.HMDP −fat(F) 0.0154 ●8.BXD −db −12(M) 0.9675 ●9.BXD −db −12(F) ) p 0.6652 ●10.BXD −db −5(M) ( 10 0.7371 ●11.BXD −db −5(F) log

0.1703 ●12.BXH −apoe(M) − 0.0004 ●13.BXH −apoe(F) 1 ● 0.1531 ●14.BXH −wt(M) 13 ● 0.7831 ●15.BXH −wt(F) 0.8884 ● 16.CXB −ldlr(M) 5 0.9194 ●17.CXB −ldlr(F) ● 6 ● 8 ● ● 72 4 ● RE Summary 14 ● 12 ● 3 ● 10 15 ● 11 16● ● 09 ●● 2● 4 6 8 10 17

−1.0 −0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.33: Forest plot and PM-plot for Chr17:46530712 locus

169 Chr18: %!!#$$$$&&() "#%"#$$' (Meta P = 1.30 x 10 −7 ) PM −Plot Gene : Mbp

P −value Study Name 0.5588 ●1.HMDPxB −chow(M) 0.0150 ●2.HMDPxB −chow(F) 0.0022 ●3.HMDPxB −ath(M) ● 0.0049 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 0.2875 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.1477 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.2492 ●7.HMDP −fat(F) 0.0069 ●8.BXD −db −12(M) 0.5830 ●9.BXD −db −12(F) ) p 0.4877 ●10.BXD −db −5(M) ( 10 0.6366 ●11.BXD −db −5(F) log

0.7650 ●12.BXH −apoe(M) − 0.7273 ●13.BXH −apoe(F) 0.9516 ●14.BXH −wt(M) 0.8735 ●15.BXH −wt(F) 3 0.4259 ●16.CXB −ldlr(M) ●4 ● 0.2430 ●17.CXB −ldlr(F) ● ●8 2

RE Summary 5 ● 6 7●● 17 9 16 1 10 12 ● ●1311 ●● ●●14 0 2 4● 6 8 10 15

−0.8 −0.4 0.0 0.4 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.34: Forest plot and PM-plot for Chr18:82240606 locus

170 Chr19: "" %$$%&&()" $$#!"!' (Meta P = 4.56 x 10 −7 ) PM −Plot Gene : Lrp5

P −value Study Name 0.0211 ●1.HMDPxB −chow(M) 8.43 x 10 −6 ●2.HMDPxB −chow(F) 0.8327 ●3.HMDPxB −ath(M) ● 0.0024 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 0.3186 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.2021 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.0395 ●7.HMDP −fat(F) 0.3282 ●8.BXD −db −12(M) 0.1333 ●9.BXD −db −12(F) ) p 0.1813 ●10.BXD −db −5(M) ( 2

10 ● 0.6694 ●11.BXD −db −5(F) log

0.7650 ●12.BXH −apoe(M) − 0.7273 ●13.BXH −apoe(F) 0.9516 ●14.BXH −wt(M) 0.8735 ●15.BXH −wt(F) 4 0.3721 ●16.CXB −ldlr(M) ● 0.9204 ●17.CXB −ldlr(F) ● 1 7 ●

RE Summary ● 9 6 ●● 10 8 ● 5 ● ● 16 11 13 ● 3 ● ●12 14● 15 17 0 2● 4● 6 8 10

−1.5 −1.0 −0.5 0.0 0.5 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.35: Forest plot and PM-plot for Chr19:3319089 locus

171 ChrX: $ "&#% #'')*" !,!,,&( (Meta P = 8.09 x 10 −9 ) PM −Plot Gene : Htr2c

P −value Study Name 0.0002 ●1.HMDPxB −chow(M) 0.0115 ●2.HMDPxB −chow(F) 0.0462 ●3.HMDPxB −ath(M) ● 0.0001 ●4.HMDPxB −ath(F) Study has an effect (m > .9) 0.3206 ●5.HMDP −chow(M) ● Study does not have an effect (m < .1) 0.3286 ●6.HMDP −fat(M) ●Study's effect is uncertain (.1< m < .9) 0.1232 ●7.HMDP −fat(F) 0.0829 ●8.BXD −db −12(M) 9.BXD −db −12(F) ) p 0.0888 ●10.BXD −db −5(M) ( 10 0.7358 ●11.BXD −db −5(F) log

12.BXH −apoe(M) − 13.BXH −apoe(F) 4 ● 14.BXH −wt(M) 1 15.BXH −wt(F) 16.CXB −ldlr(M) 17.CXB −ldlr(F) 2 ● 3 ● 8 ●● 10 RE Summary ● 5 7 ●● 6 ● 0 2 411 6 8 10

−4 −3 −2 −1 0 1 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 6.36: Forest plot and PM-plot for ChrX:151384614 locus

172 6.8 Supplementary Materials

6.8.1 Power comparison between Meta-GxE, fixed effects meta-analysis and Het- erogeneity testing approach

We estimate power for the case in which the genetic effect exists in a subset of the stud- ies. In our simulation study, assuming five studies of equal sample size of 1,000, we estimate power when all five studies exhibited the effect, when only 4 studies exhibited the effect, and so on. Supplementary Figure S1 shows that as the number of studies with effect decreases, the power of FE drops. In contrast, the RE method maintains high power regardless of the number of studies exhibiting and effect. The power of HE method is increasing as there exists more heterogeneity due to gene-by-environment interactions. Note that RE method is more powerful than HE method when there are 2 studies having interaction effects. In this case, two studies have interaction effects and three studies do not have an effect. Since the magnitude and direction of effects are same in 2 studies with interaction effects, the regression for estimating effect sizes will

2 behavior as if there exists a common effect with effect size 5 β. For this reason, RE method is most powerful over HE and FE in this case. We applied FE analysis method to 17 HDL studies. Supplementary Figure S2 shows the Manhattan plot of the result of FE analysis. The FE method found fewer (14) significant loci compared to the 26 discovered by Meta-GxE.

6.8.2 Power and type I error comparison between Meta-GxE approach and tra- ditional Wald test based GxE testing approach

To compare the power and type I error (false positive rate) between the Meta-GxE approach and traditional Wald test based approach to identify gene-by-environment interactions, we first used each approach to identify both environment-shared and

173 environment-specific effect on the phenotype in simulated data sets. For this power comparison, we generated 6 simulated genotype data sets with 1000 individuals as- suming a minor allele frequency of 0.3 and we simulated the phenotype using equation (3). For all 6 simulated studies, we assume that there is no environment-shared effect

(β = 0), We also assume that one study has no gene-by-environment effect (γi = 0) and some of the remaining 5 studies have gene-by-environment interactions (γi 6= 0). Supplementary Figure S1 and S2 show the power comparison of RE approach vs tra- ditional Wald Test based approach and that of HE approach vs traditional Wald Test based approach. The proposed RE and HE approaches consistently outperform the tra- ditional Wald test based testing framework. Supplementary Table S1 and S2 show the false positive rate comparison of RE approach vs traditional Wald Test based approach and that of HE approach vs traditional Wald Test based approach. Both the RE and HE approaches correctly control the false positive rate.

6.8.3 Details of 17 HDL mouse studies

Here we provide details of each of the mouse genetic studies involved in the meta- analysis.

6.8.3.1 BxH-Apoe(M/F)

C57BL/6J Apoe−/− (B6.Apoe−/−) were purchased from Jackson Laboratory (Bar Harbor, Maine, United States). C3H/HeJ Apoe−/− (C3H.Apoe−/−) were generated by backcrossing B6.Apoe−/− to C3H for ten generations. F1 mice were generated from reciprocal intercrossing between B6.Apoe−/− and C3H.Apoe−/−, and F2 mice were subsequently bred by intercrossing F1 mice. A total of 334 F2 mice (169 female, 165 male) were produced. All mice were fed Purina Chow containing 4% fat until 8 wk of age and then transferred to a ÒWesternÓ diet containing 42% fat and 0.17%

174 cholesterol for 16 wk. Mice were euthanized and plasma collected at 24 wk of age [WYS06].

6.8.3.2 BxD-db-5(M/F) and BxD-db-12(M/F)

We previously carried out an F2 intercross between the inbred strains DBA/2 and C57BL/6 [DNC12]. These parental mice were obtained from The Jackson Laboratory (Bar Harbor, ME). The male C57BL/6 parents carried heterozygosity deficiency in the leptin receptor (db +/-) and F1 progeny were selected for homozygosity of the mutant allele. Among F2 progeny, only those with homozygous deficiency in leptin receptor (db/db) were selected. Animals were fed a chow diet with 6% fat by weight. Plasma levels of HDL were measured in both genders at 5 weeks of age (BxD-db-5(M/F)) and at 12 weeks of age (BxD-db-12(M/F)).

6.8.3.3 HMDP-chow(M)

Male mice from the hybrid HMDP panel were purchased from The Jackson Laboratory and studied as previously described [BFO10b]. Mice were between 6 and 10 wk of age on arrival and to ensure adequate acclimatization to a common environment the mice were aged until 16 wk of age. All mice were maintained on a chow diet (Ralston- Purina Co.) until sacrifice at 16 wk of age. Following a 16-h fast, mice were bled retro-orbitally under isoflurane anesthesia.

6.8.3.4 HMDP-fat(M/F)

All mice were obtained from The Jackson Laboratory and were bred at University of California, Los Angeles to generate mice used in this study as previously described [PNO13]. Purchased mice were maintained on a chow diet (Ralston Purina Company)

175 until 8 weeks of age. Then they were given a high-fat, high-sucrose diet (Research DietsD12266B) with the following composition: 16.8% kcal protein, 51.4% kcal car- bohydrate, and 31.8% kcal fat for 8 weeks. Plasma was collected at 16 weeks of age.

6.8.3.5 HMDPxB-chow(M/F) and HMDPxB-ath(M/F)

Mice for the HMDPxB panel were created by breeding females of the various HMDP inbred strains [BFO10b] to males carrying transgenes for both Apoe Leiden [MHK93] and for human Cholesterol Ester Transfer Protein (CETP) [JAW92] on a C57BL/6 ge- netic background. Male and female F1 progeny were genotyped to verify presence of both transgenes and then provided a normal chow diet until 8 weeks of age. At that time, after a 4h fast, animals were anesthetized with isoflurane vapor and a sample of blood (HMDPxB-chow) was collected via the retro orbital sinus and plasma was prepared using EDTA coated tubes (Cat.365973, Becton Dickinson, Franklin Lakes, NJ) . These plasmas represent the HMDPxB-chow(M) (male) and HMDPxB-chow(F) (female) populations. The animals were then fed a high-fat, high-cholesterol diet (33 kcal% fat (mostly cocoa butter) + 1% cholesterol ) (Research Diets, New Brunswick, NJ) for 16 weeks at which time a second blood sample was collected by the same pro- cedures. These plasmas represent the HMDPxB-ath(M) (male) and HMDPxB-ath(F) (female) populations. All mice were maintained on a 12h light and 12h dark cycle and fed ad libitum.

6.8.3.6 CxB-ldlr(M/F)

Female BALB/cByJ-LDLRKO (designated as C) mice were crossed with male C57BL/6J- LDLRKO (designated as B) to generate F1 mice. Intercross of F1 was performed to generate F2 mice. A total of 124 male and 64 female F2 mice were phenotyped and genotyped. These F2 mice were fed a Western diet (Harlan Teklad) for 12 weeks be-

176 fore euthanasia and collection of plasma. For Genotyping, Illumina 1440 SNP Mouse medium density microarray was used with 799 informative SNPs. Two additional SNPs at 30 Mb and 67 Mb of chromosome 2 were also typed manually. Therefore a total of 801 markers were used. The average marker density was 1 SNP per 2.2 Mb (Unpublished).

6.8.3.7 BxH-wt(M/F)

BXH wild type (BXH/wt) mice were produced as previously described [NIS10]. Briefly, C57BL/6J mice were intercrossed with C3H/HeJ mice to generate 321 F2 progeny (161 females, 160 males). These mice were fed Purina Chow (Ralston-Purina) con- taining 4% fat until 8 wk of age, after which, they were fed a western diet (Teklad 88137, Harlan Teklad) containing 42% fat and 0.17% cholesterol. At 20 weeks of age, plasma samples were collected as described above. All mice were maintained on a 12 h light and 12 h dark cycle and fed ad libitum.

Reference to published article

Eun Yong Kang*, Buhm Han*, Nicholas Furlotte*, Jong Wha Joo, Diana, Shih, Richard Davis, Jake Lusis, Eleazar Eskin. “Meta-analysis identifes gene-by-environment inter- actions as demonstrated in a study of 4,323 mouse samples”, in PLOS Genetics, 2013.

177 CHAPTER 7

Genome Wide Association Study for Age-related Hearing Loss in the Mouse: A Meta-analysis

7.1 Introduction

Age-related hearing impairment (ARHI) is characterized by a symmetric sensorinueral hearing loss primarily in the high frequencies. Age of onset, progression and severity of ARHI show great variation, but males are generally more affected than females. Approximately 35% of people over the age of 65, and 50% of octagenarians suffer from ARHI [GM05].

Heritability studies have shown that the sources of this variance are both genetic and environmental, with approximately half of the variance attributable to hereditary factors [HT10]. Only a limited number of large-scale genome wide association studies (GWAS) for ARHI have been undertaken in humans, to date. A first pooled genome- wide association study for ARHI was carried out in a population consisting of eight subpopulations from 6 European countries. An association was found between ARHI and SNPs within the GRM7 gene. This gene encodes the metabotropic glutamate receptor type7, which is activated through L-glutamate, the primary excitatory neuro- transmitter in the auditory system [FVH09]. This study was followed up in a US pop- ulation and an association was found with GRM7 and ARHI on several measures of central auditory function (DINA). A GWAS in 352 samples from the Saami, an isolated

178 population originating from northern Finland, revealed no genome-wide significant as- sociations for ARHI, although the authors noted one SNP immediately downstream of the GRM7 gene among the most significant association signals [VHH10].

As delineated above for ARHI, the genetic analysis of human complex traits has been revolutionized by the ability to carry out association studies on a genome-wide basis. Such GWAS have been applied to numerous complex traits [MCC09, ADL08]. Despite these successes, the fraction of the genetic component that has been explained is relatively modest for most traits. Furthermore, formal proof that a specific vari- ant is responsible for a given trait has proven difficult. Buoyed by the prospects and successes of human association studies, several groups have proposed mouse GWAS [BFO10b, GRF12, VSG06, YNB10, CAA04, FE12, MF13, KKW10a]. For obvious reasons, mouse models have several advantages over human studies. The environ- ment can be more carefully controlled, measurements can be replicated in genetically identical animals, and the proportion of the variability explained by genetic varia- tion is increased. Complex traits in mouse strains have been shown to have higher heritability and genetic loci often have stronger effects on the trait compared to hu- mans [LWD00, WPB03, YWF04] Furthermore, several recently developed strategies for mouse genetic studies, such as use of the Hybrid Mouse Diversity Panel (HMDP), provide much higher resolution for associated loci than traditional approaches to quan- titative trait loci (QTL) mapping [BFO10b]. In genome-wide association studies in hu- mans, the use of meta-analysis is becoming more and more popular because one can virtually collect tens of thousands of individuals that will provide power to identify associated variants with small effect sizes [HMK13, AWG13, BSJ13].

Motivated by the success in human studies, in this chapter, we describe the use of a meta-analytic approach based upon a random effects model to identify loci for ARHI in the mouse by combining data sets from several studies. Utilizing a frame-

179 work facilitating the interpretation of the results of the meta-analysis combining inbred strains within the HMDP, a QTL study mapping ahl8, and auditory evoked potential data on common inbred strains (JAX), we are able to distinguish between the data sets predicted to have an effect, the data sets predicted not to have an effect, and the am- biguous studies that were under-powered [HE12] for each locus. Several loci that we have identified with our combined analysis were not previously evident in any of the individual studies.

7.2 Results

7.2.1 New loci discovered through random-effects meta-analysis

We combined heterogeneous phenotypic data sets (auditory evoked potential (ABR) thresholds for 8 kHz, 16 kHz, and 32 kHz) including 226 classic inbred strains (Zheng M and F), 387 F2 backcross mice (M and F) from the ahl8 mapping study [JLG08], and 324 mice from our HMDP panel. The ages of mice at the time of testing were different in each study. Among these datasets, the Zheng dataset phenotyped ani- mals at different ages which, as we show in our analysis, leads to confounding factors likely causing spurious associations. Application of the random effect meta-analysis approach requires that we first compute the effect size and its standard deviation for each of the 5 studies using linear mixed model association mapping [HE11b, FKV12]. This strategy corrects for population structure underlying the data sets. Supplementary Figures S1 ∼ S3 show the manhattan plots of 5 data sets for three hearing phenotypes (8 kHz, 16 kHz, and 32 kHz tone bursts). As seen in these figures, the Zheng data yield many association peaks, likely false positives secondary to insufficient power. Thus we excluded the Zheng data from the meta-analysis. We applied meta-analysis on the remaining three data sets (Johnson(F), Johonson(M), and HMDP(F)). Figure

180 7.1 shows the manhattan plot combining these data for the three hearing phenotypes (8khz, 16khz, 32khz). Tables 7.1 summarize significant peaks from our meta-analysis for the three hearing phenotypes (8khz, 16khz, 32khz). The combined meta p-value for the ahl8 locus is very significant at all frequencies tested.

The effect sizes, standard errors and p-values for the chromosome 11 locus are shown as forest plots in Figures 2 - 4 (a). The size of black square and horizontal lines for each study in forest plot represent the precision of the effect size estimate and standard error of the effect size estimate respectively. The diamond rectangle represents the summarized log odds ratio using all studies. While we excluded the Zheng data due to possible confounding, we include it in the forest plots to aid in interpretation of the significant peaks. Similar effect sizes in the Zheng data compared to the other studies for the chromosome 11 peak suggest that even in the presence of confounding factors, the effect is present in the Zheng data.

7.2.2 Interpreting meta-analysis using posterior probabilities

As seen in the tables and figures, the presence of the effect may not be reflected in the study-specific p-value due to a lack of statistical power. Therefore, it is difficult to distinguish if an effect is absent in a particular study due to a gene-by-environment interaction at the locus or a lack of power. In order to identify which studies have effects, we utilize a statistic called the m-value [HE12], which estimates the posterior probability of an effect being present in a study given the observations from all other studies. We visualize the results through a PM-plot, in which p-values (y-axis) are simultaneously visualized with the m-values (x-axis) at each tested locus. These plots allow us to identify in which studies genetic variation at the locus has an effect and in which it does not. M-values for a given variant have the following interpretation: a study with a small m-value(<= 0.1) is predicted not to be affected by the variant,

181 (a)

(b)

(c)

Figure 7.1: Mouse hearing association results from random effect meta analysis while a study with a large m-value(>= 0.9) is predicted to be affected by the variant. Whether or not an effect is present for a study with a m-value between 0.1 and 0.9 is ambiguous. The m-values are also displayed by the study name near the forest plot where the color dot on the left-hand side of the study name provides information about the m-value. A red dot indicates that the study’s m-value is greater than 0.9, a blue dot represents that the study’s m-value is less than 0.1, and a green dot represents that the

182 (a) 8khz

SNP Meta P # of Studies w/ # Studies

Location Significant P E/A/N

Chr1:196270162 1.05×10−9 1 2/3/0

Chr8:130247700 3.92×10−6 0 2/2/1

Chr11:120818214 3.08×10−11 1 2/3/0

(b) 16khz

SNP Meta P # of Studies w/ # Studies

Location Significant P E/A/N

Chr1:196270162 5.97×10−7 0 3/2/0

Chr8:130247700 1.54×10−7 0 2/2/1

Chr9:30287464 5.25×10−7 0 3/2/0

Chr11:120818214 3.32×10−15 1 5/0/0

(c) 32khz

SNP Meta P # of Studies w/ # Studies

Location Significant P E/A/N

Chr1:196270162 3.23×10−7 0 4/1/0

Chr8:131366159 3.59×10−6 0 3/2/0

Chr11:120818214 1.06×10−12 1 3/2/0

Chr18:24320393 2.32×10−6 0 3/2/0

Table 7.1: (a) Three significant loci were identified in the meta-analysis for the 8khz tone burst. (b) Four significant loci were identified in the meta-analysis for the 16khz tone burst. (c) Three significant loci were identified in the meta-analysis for the 8khz tone burst. E denotes the number of studies with an effect on the 8 kHz phenotype. N denotes the number of studies without an effect. A denotes the number of studies with an ambiguous effect.

183 study’s m-value is between 0.1 and 0.9.

The PM-plots for the chromosome 11 locus for each tested frequency are shown in Figures 7.2 (b) - 7.4 (b). If we only look at the separate study p-values (y-axis), we can conclude that this locus only has an effect for 8 kHz in the Johnson (F) cohort. However, if we look at m-value (x-axis), then we find 2 studies (Johnson (F) and Johnson (M)), where we predict that the variation has an effect, while in the other 3 studies we predict there is no effect. For the 16 kHz trait, we see only the Johnson (M) and (F) cohorts have an effect individually but if we consider the m-values all 5 demonstrate an effect at this locus. Lastly, for the 8 kHz trait, only the Johnson (M) cohort has an individual effect but taken in aggregate three studies, Johnson (M and F) and HMDP (F) have an effect at this locus.

In Tables 1-3, the column E/A/N summarizes the m-values for each associated peak providing the counts of the number of studies where there is a predicted effect (E), the number of studies where the effect is ambiguous (A) and the number of studies where we predict that the effect is not present (N). A comparison of this column compared to the column providing the count of the studies with a significant p-value shows that, using the m-value predicts significantly more studies that have an effect compared with the p-value alone. This is consistent with the fact that each individual study is underpowered.

Through forest plot, we can easily see which study shows an strong effect on the hearing phenotype at this locus and which study does not show an effect on the hearing phenotype at this locus. One interesting observation is that for 16khz hearing pheno- type in Figure 7.3, the log odds ratio of the effect size of all 5 studies are very close (around 0.5). This is a strong evidence that an effect of the locus Chr11:120818214 on mouse hearing ability is likely to be real biological signal. For two other hearing phenotypes (8khz and 32khz) shown in Figure 7.2 and 7.4, the forest plots show weak

184 Chr11:120818214 (Meta P = 3.08 x 10−11 ) PM−Plot 10

P −value Study Name

0.9353 ●Zheng(F)

8 ● Study has an effect (m > .9) ● 0.2562 ●Zheng(M) Study does not have an effect (m < .1) ●Study's effect is uncertain (.1< m < .9) −5 ● Johnson(M) 1.65 x 10 Johnson(F) ● 6 )

−7 ● p

6.07 x 10 Johnson(M) ( 10

● g ● o 0.1104 HMDP(F) l Johnson(F) − 4

RE Summary 2

● HMDP(F) Zheng(M) ● Zheng(F) ● 0

−1.0 −0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 7.2: (A) Forest plot and (B) PM-plot for Chr11:120818214 locus (8khz) effect size in two Johnson studies. Supplementary Figures 7.8 to 7.18 show the forest and PM plots for all of the discovered loci for each of the phenotypes.

7.2.3 Novel loci identified in meta-analysis

Our approach discovered an additional four loci for ARHI in mice. What is evident from the data in Table 7.1 is the power of the meta-analysis approach to identify ad- ditional loci in the combined data set that would have gone unrecognized in the indi- vidual studies due to insufficient power. As seen in Table 1, all but one of the cohorts in the meta-analysis detected SNP associations with the 8 kHz trait whereas in the individual cohorts only two of the five were significant. For the 16 kHz and 32 kHz traits, only one study was powered to detect an association at the Ahl8 locus by itself, whereas 3 studies showed association in the meta-analysis. The other loci would have gone unrecognized without the combined analysis.

185 Chr11:120818214 (Meta P = 3.32 x 10−15 ) PM−Plot 10

P −value Study Name Johnson(M) ● 0.0367 ●Zheng(F)

8 ● Study has an effect (m > .9) ● 0.1866 ●Zheng(M) Study does not have an effect (m < .1) ●Study's effect is uncertain (.1< m < .9) 6.47 x 10−6 ●Johnson(F) 6 )

−9 ● p Johnson(F)

3.14 x 10 Johnson(M) ( ● 10

● g o

0.0039 HMDP(F) l − 4

HMDP(F) ●

RE Summary 2 Zheng(F) ● ● Zheng(M) 0

−0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 7.3: (A) Forest plot and (B) PM-plot for Chr11:120818214 locus (16khz)

Chr11:120818214 (Meta P = 1.06 x 10−12 ) PM−Plot 10

P −value Study Name

0.9407 ●Zheng(F)

8 ● Study has an effect (m > .9) Johnson(M) ● ● 0.2500 ●Zheng(M) Study does not have an effect (m < .1) ●Study's effect is uncertain (.1< m < .9) 1.35 x 10−5 ●Johnson(F) 6 )

−8 ● p

2.42 x 10 Johnson(M) ( 10

g ● ● o

0.0687 HMDP(F) l Johnson(F) − 4

RE Summary 2 ● HMDP(F) Zheng(M) ● Zheng(F) ● 0

−1.0 −0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 7.4: (A) Forest plot and (B) PM-plot for Chr11:120818214 locus (32khz)

186 Lastly, a look at the meta-analysis data reveals that there are three loci associated with the 8 kHz trait on chromosomes 1, 8, and 11. However, for the 16 and 32 kHz studies, there are additional loci on chromosomes 9 and 18, respectively. This un- derlies the complexity of the hearing phenotype, its polygenic nature, and the likely effects of genetic variation on specific regions of the cochlea. The identification of an additional locus for the 32 kHz trait is also interesting in light of the fact that the Fscn2 BAC transgenic rescue of the ahl8 associated hearing loss in DBA/2J was suc- cessful at all frequencies except 32 kHz. It is possible that a locus on chromosome 18 may contribute to DBA/2J hearing loss. Equally plausible, given the size of the BAC transgenic, there may be additional genes in the ahl8 interval contributing to the phenotype.

The resolution in mouse association studies is limited by the size of linkage dise- quilibrium blocks. For the loci reported here, these blocks span roughly 2 Mb (con- sistent with average LD blocks in previous HMDP studies [DNB13, BFO10b] and contain a number of potential candidates including membrane transporter genes, tran- scription factors, and neurotransmitter receptors. Many of the genes within the inter- vals are expressed in the cochlea (data not shown) complicating candidate selection to some degree; however, the gene lists are relatively small and serve as a starting point for gene discovery. Snhl2, an ARHI locus identified in a backcross between ALR/LtJ and C3HeB/FeJ mapped to chromosome 1 (133 Mb-172 Mb) is nearby and these data may serve to refine this locus or may represent an additional locus [LNN11]. The two loci, Hfhl3 on chromosome 9 and ahl6 on chromosome 18, map closely to the re- gions identified in our meta-analysis [KN12, DN06]. In contrast, although Hfhl2 maps to chromosome 8, it is very proximal to the locus identified in our study and likely represents a novel locus [KNL11].

187 7.3 Discussion

In this chapter, we used a recently developed meta-analysis approach that can be ap- plied to a large number of heterogeneous studies each conducted in different environ- ments with animals from different genetic backgrounds, different genders and different ages at the time of testing. We show the practical utility of the proposed method by applying it to 5 mouse ARHI studies containing 937 samples, and we successfully identify several loci involved in ARHI in the mouse, including the known locus, ahl8. Consistent with the results of meta-analysis in human studies, our combined study recognized loci that were not discovered in any of the individual studies.

Part of the reason for our success in identifying a several loci is that our study combined multiple mouse studies using fundamentally different mapping strategies. Over the past few years, many new strategies have been proposed beyond the tra- ditional F2 cross [FE12] which include the hybrid mouse diversity panel (HMDP) [BFO10b, GRF12], heterogeneous outbred stocks [VSG06], commercially available outbred mice [YNB10], and the collaborative cross [CAA04]. In our current study, we are combining data from two HMDP strategies with an N2 backcross. The meta- analysis benefits from the statistical power and resolution advantages of this combina- tion [FKV12].

As shown in Table 7.2, there are currently 17 autosomal loci and one mitochondrial locus for ARHI in mice (www.hearingimpairment.jax.org). Of these 18 loci, only 6 have been characterized at the gene level illustrating the inherent difficulties in tradi- tional gene mapping studies. Much of the progress in the genetics of hearing disorders in the mouse has come from the application of linkage analysis (i.e. QTL analysis) to identify naturally occurring single gene mutations (Mendelian traits) and the analysis of targeted gene deletions. Little attention has been directed towards the definition of the genetics of common hearing disorders. Classical genetic approaches have been

188 used to identify several QTLs that are associated with ARHI in mice. However one of the most significant shortcomings of QTL analysis is the use of a limited resource (i.e. a segregating F2 population). Another limitation of this approach is the genomic resolution, typically on the order of 10 Mb, or greater. Hence, further progress with these strategies depends upon execution of large-scale fine-mapping crosses where-in several thousand N2 or F2 animals are screened for recombination within the region of interest. Following this several generations of subcongenic lines must be estab- lished followed by progeny testing. This process can take years of effort and never achieve the resolution necessary to identify individual genes or causal variants. In this manuscript we have demonstrated the utility of GWAS using a meta-analysis approach for the high resolution mapping of loci associated with ARHI in the mouse. Subse- quent studies will include a replication of this experiment utilizing the entire HMDP and the selection of candidate genes based upon cochlear eQTL information, cochlear gene expression, and genes known to be associated with hearing loss in either mouse or humans. Since the mouse inner ear is functionally and genetically very similar to the human ear, these studies will facilitate the identification of genes and pathways associated with ARHI in humans.

7.4 Methods and Materials

7.4.1 Controlling for population structure within studies

Model organism such as mouse is well-known for the population structure or cryptic relatedness [DRB01, VP05], in which genetic similarities between individuals both inhibit the ability to find true associations and cause the appearance of a large number of false or spurious associations. Mixed effects models are often used in order to correct this problem [Lan02, YPB06, KZW08]. Methods employing a mixed effects

189 correction account for the genetic similarity between individuals with the introduction of a random variable into the traditional linear model.

Yi = µ + δiXr + ui +  (7.1)

In the model in equation (8.8), the random variable ui represents the vector of genetic contributions to the phenotype for individuals in population i. This random

2 variable is assumed to follow a normal distribution with ui ∼ N(0, σg Ki), where Ki is the ni × ni kinship coefficient matrix for population i. With this assumption, the total

2 2 variance of Yi is given by Σi = σg Ki + σe I. A z-score statistic is derived for the test ˆ δi = 0 by noting the distribution of the estimate of δi. In order to avoid complicated notation, we introduce a more basic matrix form of the model in equation (8.8), shown in equation (8.9).

Yi = SiΓ + ui +  (7.2)

In equation (8.9), Si is a ni × 2 matrix encoding the global mean and SNP vectors and Γ is a 2 × 1 coefficient vector. We note that this form also easily extends to models with multiple covariates. The maximum likelihood estimate for Γ in population i is ˆ 0 −1 −1 0 −1 given by Γi = (SiΣi Si) SiΣi Yi which follows a normal distribution with a mean 0 −1 −1 equal to the true Γ and variance (SiΣi Si) . The estimates of the effect size Xi and standard error of the Xi (SE(Xi)) are then given in equation (8.10) and equation (8.11), where R = [0 1] is a vector used to select the appropriate entry in the vector ˆ Γi.

0 −1 −1 0 −1 δi = R(SiΣi Si) SiΣi Yi (7.3)

0 −1 −1 0 1/2 SE(δi) = [R(SiΣi Si) R ] (7.4)

190 This mixed model based approach, well-known for correcting population structure was applied to the studies reported here.

7.4.1.1 Random effects model meta analysis

Under the random effects model meta-analysis, we explicitly model heterogeneity by assuming a hierarchical model. We assume that the effect size of each study δi is a random variable picked by random from a distribution with the grand mean δ and the variance τ 2,

2 δi ∼ N(δ, τ )

We recently developed a powerful random effects model, which addresses the problem of the conservative nature of traditional random effects model by assuming no hetero- geneity under the null hypothesis [HE11b]. This modification is natural because the effect size should be fixed to be zero under the null hypothesis. This random effects model tests the null hypothesis δ = 0 and τ 2 = 0 versus the alternative hypothesis δ 6= 0 or τ 2 6= 0.

Similarly to the traditional random effect model [DL86a], we use the likelihood ratio framework considering each statistic as a single observation. Since we assume no heterogeneity under the null, µ = 0 and τ 2 = 0 under the null hypothesis. The likelihoods are then

Y 1  δ2  L = √ exp − i 0 2πV 2V i i i  2  Y 1 (δi − µ) L1 = exp − . p 2 2(V + τ 2) i 2π(Vi + τ ) i

The maximum likelihood estimates µˆ and τˆ2 can be found by an iterative procedure suggested by Hardy and Thompson [HT96]. Then the likelihood ratio test statistic can

191 be built

  2 2 X Vi X δi X (δi − µˆ) Smeta = −2 log(λ) = log 2 + − 2 , (7.5) Vi +τ ˆ Vi Vi +τ ˆ

0 −1 −1 0 −1 0 −1 −1 0 where δi = R(SiΣi Si) SiΣi Yi and Vi = [R(SiΣi Si) R ]. P-values can be easily computed using precomputed tabulated values [HE11b].

Identifying studies with an effect

After identifying loci exhibiting interaction effects, we employ the meta-analysis inter- pretation framework that we recently developed. The m-value [HE12] is the posterior probability that the effect exists in each study. Suppose we have N number of studies we want to combine. Let D = [δ1, δ2, ..., δN] be the vector of estimated effect sizes and V = [V1, V2, ..., VN] be the vector of estimated variance of N effect sizes. We assume that the effect size δi follows the normal distribution.

P(δi|no effect) = N(δi; 0, Vi) (7.6)

P(δi|effect) = N(δi; µ, Vi) (7.7)

Normality assumption above can be easily justified by the large number of sample size in the current GWASs. We assume that the prior for the effect size is

µ ∼ N(0, σ2) (7.8)

A possible choice for σ in GWAS is 0.2 for small effect and 0.4 for large effect [SB09].

We also denote Ei be a random variable whose value is 1 if a study i has an effect and 0 otherwise. We also denote E as a vector of Ei for N studies. Since E has

N N binary values, E can be 2 possible configurations. Let U = [e1, ..., e2N ] be a vector containing all the possible these configurations. We define m-value mi as the probability P(Ei = 1|D), which is the probability of study i having an effect given

192 the estimated effect sizes. We can compute this probability using the Bayes’ theorem in the following way. P P(D|E = e)P(E = e) e∈Ui mi = P(Ei = 1|D) = P (7.9) e∈U P(D|E = e)P(E = e) where Ui is a subset of U whose elements’ ith value is 1. Now we need to compute P(D|E = e) and P(E = e). P(E = e) can be computed as

B(|e| + α, N − |e| + β) P(E = e) = (7.10) B(α, β) where |e| denotes the number of 1’s in e and B denotes the beta function and we set α and β as 1 [HE12] . The probability D given configuration e, P(D|E = e), can be computed as

Z ∞ Y Y P(D|E = e) = N(δi; 0, Vi) N(δi; µ, Vi)p(µ) dµ (7.11) −∞ i∈t0 i∈t1 ¯ 2 Y = CN¯ (δ; 0, V¯ + σ ) N(δi; 0, Vi) (7.12)

i∈t0 where

P W δ ¯ i i i ¯ 1 δ = P and V = P (7.13) i Wi i Wi where N(δ; a, b) denotes the probability density function of the normal distribution −1 ¯ with mean a and variance b. Wi = Vi is the inverse variance or precision and C is a scaling factor. s ( !) Q W (P W δ )2 ¯ 1 i i 1 X 2 i i i C = √ P exp − Wiδi − P (7.14) N−1 Wi 2 Wi ( 2π) i i i ¯ All summations appeared for computing δ, V¯ and C¯ are with respect to i ∈ t1.

The m-values have the following interpretations: small m-values(0.1) represent a study that is predicted to not have an effect, large m-values(0.9) represent a study that is predicted to have an effect, otherwise it is ambiguous to make a prediction.

193 It was previously reported that m-values can accurately distinguish studies having an effect from the studies not having an effect [HE12]. For interpreting and understanding the result of the meta-analysis, it is informative to look at the P-value and m-value at the same time. We applied the PM-plot framework [HE12], which plots the P- values and m-values of each study together in two dimensions. For studies with an m-value between 0.1 and 0.9, we cannot make a decision. One reason that studies are ambiguous (0.9 ≤ m-value ≤ 0.1) is that they are underpowered due to small sample size. If the sample size increases, the study can be drawn to either the left or the right side.

194 7.5 Supplementary Materials

195 7.5.1 Mouse hearing association results

Figure 7.5: The association result of three studies for 8khz mouse hearing.

196 Figure 7.6: The association result of three studies for 16khz mouse hearing.

7.5.2 PM-plots for significant locus

197 Figure 7.7: The association result of three studies for 32khz mouse hearing.

198 AHL locus Chr Location Mapping Known strains Known strains references

(gene) methods with with

* susceptibility resistance

allele allele

ahl(Cdh23) 10 60Mb backcross, strain Many Many [JEC97]

haplotypes [JZE00]

[NZJ03]

ahl4(Cs) 10 120-130 RI, CS strains A/J C57BL/6J, [ZDY09]

Mb backcross CAST/Ei [JGL12]

ahl5(Gipc3) 10 81Mb backcross Black Swiss CAST/Ei [DN06]

intercross [CLS11]

ahl8(Fscn2) 11 120Mb RI strains DBA/2J C57BL/6J [JLG08]

backcross [SLG10]

Gpr98, G protein 13 81Mb backcross BUB/BnJ CAST/Ei [JZW05]

coupled recepter 9

mtDNA(mt-tr) mitochondria backcross A/J CAST/Ei [JZB01]

Table 7.2: Genetic factors that contribute to age-related hearing loss in inbred mouse strains. CS=chromosome substitution; RI=recombinant inbred.

199 Chr1:196270162 (Meta P = 1.05 x 10−9 ) PM−Plot 10

P −value Study Name

0.4667 ●Zheng(F)

8 ● Study has an effect (m > .9) ● 0.5960 ●Zheng(M) Study does not have an effect (m < .1) ●Study's effect is uncertain (.1< m < .9) 3.87 x 10−6 ●Johnson(F) 6

) Johnson(F) −5 ● p

6.58 x 10 Johnson(M) ( ● 10

● g o

0.1668 HMDP(F) l

− ● 4 Johnson(M)

RE Summary 2

● HMDP(F) Zheng(F) ● Zheng(M) ● 0

−0.8 −0.4 0.0 0.4 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 7.8: (A) Forest plot and (B) PM-plot for Chr1:196270162 locus (8khz)

Chr8:130247700 (Meta P = 3.92 x 10−6 ) PM−Plot 10

P −value Study Name

0.8162 ●Zheng(F)

8 ● Study has an effect (m > .9) ● 0.3230 ●Zheng(M) Study does not have an effect (m < .1) ●Study's effect is uncertain (.1< m < .9) 0.0001 ●Johnson(F) 6 )

● p

0.0002 Johnson(M) ( 10

● g o

0.9248 HMDP(F) l

− Johnson(F) 4 ● Johnson(M)

RE Summary 2

Zheng(M) ● HMDP(F) ● ● Zheng(F) 0

−0.8 −0.4 0.0 0.4 0.8 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 7.9: (A) Forest plot and (B) PM-plot for Chr8:130247700 locus (8khz)

200 Chr11:120818214 (Meta P = 3.08 x 10−11 ) PM−Plot 10

P −value Study Name

0.9353 ●Zheng(F)

8 ● Study has an effect (m > .9) ● 0.2562 ●Zheng(M) Study does not have an effect (m < .1) ●Study's effect is uncertain (.1< m < .9) −5 ● Johnson(M) 1.65 x 10 Johnson(F) ● 6 )

−7 ● p

6.07 x 10 Johnson(M) ( 10

● g ● o 0.1104 HMDP(F) l Johnson(F) − 4

RE Summary 2

● HMDP(F) Zheng(M) ● Zheng(F) ● 0

−1.0 −0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 7.10: (A) Forest plot and (B) PM-plot for Chr11:120818214 locus (8khz)

Chr1:196270162 (Meta P = 5.97 x 10−7 ) PM−Plot 10

P −value Study Name

0.7866 ●Zheng(F)

8 ● Study has an effect (m > .9) ● 0.8271 ●Zheng(M) Study does not have an effect (m < .1) ●Study's effect is uncertain (.1< m < .9) 0.0009 ●Johnson(F) 6 )

● p

0.0019 Johnson(M) ( 10

● g o

0.0235 HMDP(F) l − 4

Johnson(F) ● ● Johnson(M) RE Summary 2 ● HMDP(F)

Zheng(M) ●● Zheng(F) 0

−0.8 −0.4 0.0 0.2 0.4 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 7.11: (A) Forest plot and (B) PM-plot for Chr1:196270162 locus (16khz)

201 Chr8:130247700 (Meta P = 1.54 x 10−7 ) PM−Plot 10

P −value Study Name

0.6635 ●Zheng(F)

8 ● Study has an effect (m > .9) ● 0.3242 ●Zheng(M) Study does not have an effect (m < .1) ●Study's effect is uncertain (.1< m < .9) 9.91 x 10−5 ●Johnson(F) 6 )

−5 ● p

2.32 x 10 Johnson(M) (

10 Johnson(M) ● g o ●

0.7489 HMDP(F) l −

4 ● Johnson(F)

RE Summary 2

Zheng(M) ● Zheng(F) ●● HMDP(F) 0

−0.8 −0.4 0.0 0.4 0.8 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 7.12: (A) Forest plot and (B) PM-plot for Chr8:130247700 locus (16khz)

Chr9:30287464 (Meta P = 5.25 x 10−7 ) PM−Plot 10

P −value Study Name

0.3191 ●Zheng(F)

8 ● Study has an effect (m > .9) ● 0.5694 ●Zheng(M) Study does not have an effect (m < .1) ●Study's effect is uncertain (.1< m < .9) 0.0031 ●Johnson(F) 6 )

● p

0.0001 Johnson(M) ( 10

● g o

0.0903 HMDP(F) l Johnson(M) −

4 ●

● Johnson(F) RE Summary 2

● HMDP(F) Zheng(F) ● Zheng(M) ● 0

−0.8 −0.6 −0.4 −0.2 0.0 0.2 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 7.13: (A) Forest plot and (B) PM-plot for Chr9:30287464 locus (16khz)

202 Chr11:120818214 (Meta P = 3.32 x 10−15 ) PM−Plot 10

P −value Study Name Johnson(M) ● 0.0367 ●Zheng(F)

8 ● Study has an effect (m > .9) ● 0.1866 ●Zheng(M) Study does not have an effect (m < .1) ●Study's effect is uncertain (.1< m < .9) 6.47 x 10−6 ●Johnson(F) 6 )

−9 ● p Johnson(F)

3.14 x 10 Johnson(M) ( ● 10

● g o

0.0039 HMDP(F) l − 4

HMDP(F) ●

RE Summary 2 Zheng(F) ● ● Zheng(M) 0

−0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 7.14: (A) Forest plot and (B) PM-plot for Chr11:120818214 locus (16khz)

Chr1:196345862 (Meta P = 3.23 x 10−7 ) PM−Plot 10

P −value Study Name

0.6626 ●Zheng(F)

8 ● Study has an effect (m > .9) ● 0.1313 ●Zheng(M) Study does not have an effect (m < .1) ●Study's effect is uncertain (.1< m < .9) 0.0010 ●Johnson(F) 6 )

● p

0.0003 Johnson(M) ( 10

● g o

0.0661 HMDP(F) l −

4 Johnson(M) ● ● Johnson(F)

RE Summary 2 ● HMDP(F) Zheng(M) ●

Zheng(F) ● 0

−0.4 −0.2 0.0 0.2 0.4 0.6 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 7.15: (A) Forest plot and (B) PM-plot for Chr1:196345862 locus (32khz)

203 Chr8:131366159 (Meta P = 3.59 x 10−6 ) PM−Plot 10

P −value Study Name

0.0315 ●Zheng(F)

8 ● Study has an effect (m > .9) ● 0.2612 ●Zheng(M) Study does not have an effect (m < .1) ●Study's effect is uncertain (.1< m < .9) 0.0002 ●Johnson(F) 6 )

● p

0.0005 Johnson(M) ( 10

● g o

0.4874 HMDP(F) l −

4 Johnson(F) ● ● Johnson(M)

RE Summary 2 ● Zheng(F) ● Zheng(M) HMDP(F) ● 0

−0.4 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 7.16: (A) Forest plot and (B) PM-plot for Chr8:131366159 locus (32khz)

Chr11:120818214 (Meta P = 1.06 x 10−12 ) PM−Plot 10

P −value Study Name

0.9407 ●Zheng(F)

8 ● Study has an effect (m > .9) Johnson(M) ● ● 0.2500 ●Zheng(M) Study does not have an effect (m < .1) ●Study's effect is uncertain (.1< m < .9) 1.35 x 10−5 ●Johnson(F) 6 )

−8 ● p

2.42 x 10 Johnson(M) ( 10

g ● ● o

0.0687 HMDP(F) l Johnson(F) − 4

RE Summary 2 ● HMDP(F) Zheng(M) ● Zheng(F) ● 0

−1.0 −0.5 0.0 0.5 1.0 1.5 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 7.17: (A) Forest plot and (B) PM-plot for Chr11:120818214 locus (32khz)

204 Chr18:24320393 (Meta P = 2.32 x 10−6 ) PM−Plot 10

P −value Study Name

0.6232 ●Zheng(F)

8 ● Study has an effect (m > .9) ● 0.5230 ●Zheng(M) Study does not have an effect (m < .1) ●Study's effect is uncertain (.1< m < .9) 0.0294 ●Johnson(F) 6 )

● p

0.0068 Johnson(M) ( 10

● g o

0.0004 HMDP(F) l − 4 HMDP(F) ●

Johnson(M) ● RE Summary 2 ● Johnson(F)

Zheng(M) ● ● Zheng(F) 0

−0.6 −0.2 0.2 0.6 0.0 0.2 0.4 0.6 0.8 1.0

Log odds ratio m−value

Figure 7.18: (A) Forest plot and (B) PM-plot for Chr18:24320393 locus (32khz)

205 CHAPTER 8

A Novel Meta-analysis Approach for Genome-wide Association Studies with Sex-specific Effects.

8.1 Background

Genome-wide association studies (GWAS) have successfully identified numerous ge- netic loci associated with complex human traits. Most GWAS studies collect both male and female individuals in the studies. Recently, there has been an increas- ing attention to sex-specific effects in genome-wide association studies and several recent studies have observed differences in effect sizes between males and females [KCS13, PHW13, CDL13, BJC12, KCD12, MLM10, FLW12, PMP13, ML12]. In particular, a large scale study, Randall et al., (2013) applied meta-analysis over 46 studies of anthropomorphic phenotypes analyzing males and females separately to es- timate effect sizes in males and females and discovered many loci with sex-specific effects. The discovery of these sexually dimorphic associations may lead to a better understanding of the role of gene-by-sex interaction in disease mechanism.

The prevalence of sex-specific effects complicates the analysis of association stud- ies consisting of both males and females when gene-by-sex interactions exist. The most widely used traditional approach of analyzing studies consisting of both males and females involves in analyzing all of the samples together and including sex as a covariate in the statistical model when performing the analysis [KCS13, PHW13,

206 CDL13, BJC12, KCD12, MLM10, FLW12, PMP13, ML12]. Treating sex as a covari- ate in the model adjusts for differences in phenotype mean between males and females, but ignores the possibility of sex specific effects. This naturally raises the question of how does the presence of sex-specific effects affect statistical power of GWAS when sex is modeled as a covariate and what is the best analysis strategy for a single study consisting of males and females.

Unfortunately, standard association study approaches ignores the potential for dif- ferences in effect sizes between the two sexes which may lead to a loss in statistical power. Furthermore, another complication in the analysis, as we elaborate on below, is that since gene-by-sex effects are prevalent throughout the genome, when testing for association at a given locus, the gene-by-sex effects in the rest of the genome cause what we call a gene-by-sex background effect. Improperly accounting for this effect can also lead to a decrease in statistical power.

In this paper, we present a meta-analytic approach for the analysis of genome- wide association studies consisting of both males and females. In our meta-analytic approach, called Meta-Sex, males and females are analyzed separately and the results are combined using a random effects meta-analysis approach which explicitly models difference in effect sizes between sexes. Our approach naturally increases power over the traditional sex-as-covariate approach at loci where sex specific effects are present because of the advantage of the random effects meta-analysis. We also show that meta- analysis methods are more powerful even at loci which do not have sex specific effects because of the background gene-by-sex effect. We apply our method to the Northern Finland Birth Cohort data and we show that our method has increased power over a traditional approach while controlling for false positives.

Our approach is not the first meta-analysis approach proposed for analysis of sex specific effects. A previous method, GWAMA, method simply adds the two χ2 statis-

207 tics estimated from each sex and performs a two degree of freedom χ2 test [MLM10]. We analytically show the connections between several meta-analysis methods includ- ing our random effects meta-analysis, traditional fixed effects meta-analysis and GWAMA. From our analysis, GWAMA loses power compared to either random effect or fixed effect meta-analysis when the effect sizes for males and females are similar. GWAMA also suffers from a loss in power when the number of male and female samples are not equal. We also extend GWAMA to handle the scenario where different numbers of males and females are in the study by adding weights to each study following similar intuitions to inverse variance weighting for fixed effect meta-analysis.

8.2 Results

8.2.1 Applying meta-analysis to handle sex-specific effects

We refer to our meta-analytic approach to handling sex-specific effects in genome wide association studies as Meta-Sex. Our approach is as follows. We first separate the male and female individuals and estimate effect sizes and standard errors for each group separately. We then combine these estimates using a meta-analysis method. While we advocate the use of the recently developed random effects meta-analysis designed for association studies (RE2) [HE11b] to combine the studies, we report results using a variety of meta-analysis methods including the fixed effects (FE) meta-analysis and the traditional random effects meta-analysis (RE) [DL86b, IPE07a, IPE07b, EMI07]. The difference between RE and RE2 is that RE is overly conservative due to that it assume there is heterogeneity present in the null hypothesis of no association. While this assumption is reasonable for some problems, in the context of association studies, loci which do not affect a trait have a true effect size of zero in all studies. RE2 is a novel random effect meta analysis approach which addresses the issues with the traditional

208 random effects meta-analysis [HE11b] We then check to see if there is inflation of the resulting statistics because of the potential population structure between the male and female individuals in which case we apply genomic control to correct for this inflation.

To demonstrate the performance of meta-analysis approaches, we applied 5 dif- ferent approaches including three meta-analysis methods (FE, RE and RE2) and two traditional approaches ("Covariate" and "Combined) to the North Finland Birth Cohort data (NFBC) data. “Covariate” is the approach combining the male and female data and including a sex as a covariate in the statistical model. “Combined” is the approach simply combining the male and female data before performing association analysis.

8.2.1.1 Meta-analysis increases statistical power in the NFBC data

We applied our meta-analysis approach to the NFBC data. We computed the associ- ation statistics and p-values with the 5 different approachesand compute the genomic control λ to see the inflation factor of each method. Table 8.1 shows the genomic con- trol λ for 5 different approaches. FE shows slight inflation and RE shows significant deflation of the association statistic as expected, since it is overly conservative. For each method, we apply genomic control to correct for the inflation in order to be able to fairly compare the different approaches. Since RE shows significant deflation, we drop this method from the further analysis.

Table 8.2 shows the significant associations for NFBC phenotypes with 4 different sex-differentiated approaches (FE, RE2, Covariate and Combined) and 2 sex-specific approaches (male only and female only). After correcting with genomic control, if any p-value from those 6 approaches is significant (threshold : 5 × 10−7), then we include this result in the table. For each SNP, the most significant p-value among four methods (FE, RE2, Covariate and Combined) is in bold font. The “beta” column shows the effect size of male and female only study and their estimated standard errors. When

209 Phenotype FE RE RE2 Covariate Combined

tgres 1.016 0.785 0.995 1.001 1.002

hdlres 1.024 0.794 1.008 1.003 1.003

ldlres 1.024 0.795 1.011 1.002 1.002

bmires 1.014 0.789 0.992 0.994 0.994

crpres 1.004 0.774 0.998 0.993 0.993

glures 1.017 0.784 1.011 1.008 1.007

insres 1.011 0.781 1.006 1.005 1.004

sysres 1.023 0.796 1.012 1.006 1.005

diares 1.021 0.789 1.013 1.006 1.006

height 1.042 0.812 1.028 1.003 1.003

Table 8.1: The comparison of genomic control λ for 5 approaches across 10 NFBC phenotypes. there exists sex-specific effects, the effect size differences between male and female study are large.

For each significant association, we report the “best methods”. We define the best methods by the set of the methods whose p-value is less than 2 times the most signifi- cant p-value for each association. For example, for the association at Chr11:47226831 in the hdlres phenotype, the most significant p-value is 3.96 × 10−8 . Since p-value for FE meta (4.65 × 10−8) is less than twice that value. Thus we also consider FE as one of the best methods. However the p-value for Covariate and Combined is larger than twice that value. Thus only FE and RE2 are included in the best methods for this

210 SNP. The interesting observation in the best methods category, when applying these four methods, is that meta analysis approaches (FE and RE2) are typically selected as the best methods together and the standard approaches (CV, CM) are likely selected as the best methods together. To get an overall comparison of the relative power of the approaches, we compare the number of times the meta-analysis approaches are selected as best methods relative to the number of times the traditional approaches are selected as the best methods. For the 6 NFBC phenotypes with significant associa- tions, FE/RE2 approaches are the best methods for 13 associated loci whereas CV/CM approaches are the best methods for only 3 associated loci. For 11 associated loci, all four approaches provide similar p-values.

8.2.2 The presence of background GxS interactions explains the advantage of the sex-differentiated meta-analysis approach

To explain this phenomenon, we make the note that sex specific effects is present throughout the genome and we refer to their effect on the phenotype as the gene-by- sex background effect. We hypothesized that the presence or absence of gene-by-sex interactions throughout the genome may lend an advantage to one method over another.

To test this hypothesis, we generate simulated phenotype data using the Finland Birth Cohort as a model population. For each simulated phenotype, we assume a given heritability (hg) and a given GxS heritability (hs) and generate a phenotype by sampling from a multivariate normal distribution with mean zero and covariance equal to hgK + hsKs + hI I, where K represents the kinship matrix for males and females and Ks represents the GxS kinship matrix (see Methods) and hI can be computed by

1 − hg − hs. We insert an effect for a given SNP and then attempt to identify this effect as significant using each method. Power is calculated as the proportion of the number of effects identified as significant for a large number of such simulations. We calculate

211 Phenotype : crpres sex-specific beta±std error best

Position rsid FE RE2 Covariate Combined female male female male methods

Chr1:157945440 rs2794520 1.60e-23 3.67e-23 3.01e-23 2.97e-23 1.28e-13 2.93e-11 0.312±0.042 0.287±0.043 FE,CV,CM

Chr12:119873345 rs2650000 1.06e-12 1.80e-12 1.96e-12 1.98e-12 2.79e-06 8.65e-08 0.197±0.042 0.220±0.041 FE,RE2,CV,CM

Chr17:68989647 rs7406617 0.00011 3.61e-06 8.15e-05 8.11e-05 1.66e-07 0.88493 -0.251±0.048 -0.006±0.049 RE2

Phenotype : glures sex-specific beta±std error best

Position rsid FE RE2 Covariate Combined female male female male methods

Chr2:169471394 rs560887 2.43e-12 4.06e-12 9.32e-13 9.42e-13 8.19e-06 4.83e-08 0.008±0.002 0.010±0.002 CV,CM

Chr7:15030358 rs10244051 2.14e-07 2.61e-07 1.74e-07 1.77e-07 0.00029 0.00021 -0.007±0.002 -0.007±0.002 FE,RE2,CV,CM

Chr11:92362409 rs1447352 1.31e-07 1.59e-07 9.07e-08 8.72e-08 0.00037 9.46e-05 0.007±0.002 0.007±0.002 FE,RE2,CV,CM

Chr11:92363999 rs7121092 1.32e-07 1.59e-07 9.53e-08 9.12e-08 0.00031 0.00011 0.007±0.002 0.007±0.002 FE,RE2,CV,CM

Chr12:96309986 rs7298683 8.55e-05 1.73e-07 7.63e-05 7.54e-05 0.86718 5.92e-09 -0.000±0.004 0.023±0.004 RE2

Phenotype : hdlres sex-specific beta±std error best

Position rsid FE RE2 Covariate Combined female male female male methods

Chr2:21047434 rs6728178 2.19e-07 2.68e-07 4.52e-07 4.54e-07 0.00390 1.16e-05 -0.031±0.011 -0.043±0.010 FE,RE2

Chr8:19875201 rs10096633 4.74e-07 4.62e-07 2.66e-06 2.69e-06 0.04291 6.66e-07 -0.034±0.017 -0.074±0.015 FE,RE2

Chr11:47226831 rs2167079 4.65e-08 3.96e-08 1.22e-06 1.22e-06 0.01838 9.78e-08 -0.023±0.010 -0.053±0.010 FE,RE2

Chr11:47242866 rs7120118 4.57e-08 4.09e-08 1.09e-06 1.09e-06 0.01679 1.15e-07 -0.023±0.010 -0.053±0.010 FE,RE2

Chr15:56470658 rs1532085 1.74e-12 2.93e-12 1.01e-11 9.98e-12 1.87e-06 2.25e-07 -0.047±0.010 -0.046±0.009 FE,RE2

Chr16:55550825 rs3764261 8.83e-33 2.37e-32 3.89e-32 3.86e-32 8.85e-16 3.63e-18 -0.088±0.011 -0.087±0.010 FE

Chr16:66570972 rs255049 3.22e-09 4.66e-09 1.35e-08 1.36e-08 0.00014 5.85e-06 -0.045±0.012 -0.049±0.011 FE,RE2

Chr20:42475778 rs1800961 1.50e-08 2.11e-08 1.82e-08 1.83e-08 0.00051 6.29e-06 0.079±0.023 0.104±0.023 FE,RE2,CV,CM

Phenotype : ldlres sex-specific beta±std error best

Position rsid FE RE2 Covariate Combined female male female male methods

Chr1:109620053 rs646776 7.38e-16 1.39e-15 3.92e-15 3.92e-15 8.36e-10 2.11e-07 0.166±0.027 0.161±0.031 FE,RE2

Chr1:205941798 rs4844614 1.27e-07 1.52e-07 2.52e-07 2.52e-07 0.00080 2.78e-05 -0.080±0.024 -0.113±0.027 FE,RE2,CV,CM

Chr2:21085700 rs693 5.90e-12 9.71e-12 2.83e-11 2.83e-11 8.81e-09 0.00013 -0.126±0.022 -0.099±0.026 FE,RE2

Chr19:11056030 rs11668477 1.42e-08 1.66e-08 3.89e-09 3.88e-09 0.00240 2.62e-07 0.088±0.029 0.175±0.034 CV,CM

Chr19:50087106 rs157580 3.18e-07 3.86e-07 3.24e-07 3.25e-07 9.62e-06 0.00682 0.110±0.025 0.075±0.028 FE,RE2,CV,CM

Phenotype : sysres sex-specific beta±std error best

Position rsid FE RE2 Covariate Combined female male female male methods

Chr2:55702813 rs782602 7.30e-08 8.55e-08 1.42e-07 1.42e-07 6.00e-06 0.00246 1.555±0.343 1.115±0.368 FE,RE2,CV,CM

Chr7:33208521 rs10486523 5.47e-07 6.67e-07 4.95e-07 4.94e-07 0.00027 0.00059 -2.351±0.646 -2.435±0.710 FE,RE2,CV,CM

Phenotype : tgres sex-specific beta±std error best

Position rsid FE RE2 Covariate Combined female male female male methods

Chr2:21091049 rs673548 2.10e-08 2.83e-08 6.45e-08 6.42e-08 6.11e-06 0.00097 0.054±0.012 0.052±0.016 FE,RE2

Chr2:27584444 rs1260326 1.18e-10 1.83e-10 1.85e-10 1.86e-10 1.32e-06 2.10e-05 -0.053±0.011 -0.063±0.015 FE,RE2,CV,CM

Chr8:19875201 rs10096633 5.17e-08 5.97e-08 1.95e-08 1.92e-08 0.00109 3.47e-06 0.062±0.019 0.111±0.024 CV,CM

Chr17:4465888 rs9895232 7.61e-06 3.68e-06 0.00015 0.00015 4.73e-07 0.43738 0.060±0.012 0.011±0.015 FE,RE2

Table 8.2: Comparison of 4 association mapping approaches: Fixed effect meta- analysis (FE), random effect meta-analysis (RE2), sex-as-covariate (Covariate), com- bined approach (Combined)

212 power for a range of hg and hs values and we vary the level of heterogeneity in effect size between males and females.

We compare the statistical power between RE2, FE, and CV approaches while varying gene-by-sex background effect and heterogeneity between male and female effect size. We first fix the background noise heritability (hI ) to be 0.5 and vary GxS heritability (hs) from 0 to 0.25. For each chosen GxS heritability (hs), we set the hg to be 0.5 − hs. We also vary the effect size of main genetic effect between male and female, which is represented as “Heterogeneity” in the Figure 8.1. Figure 8.1 (a), (b), (c) shows power of proposed random effect meta-analysis approach, fixed effect meta- analysis approach and sex-as-covariate method respectively. In these heat maps, as the color become darker, the power is higher as the legends show. In Figure 8.1 (d), the region of heat map is colored with the color representing most powerful method in the region. This resulting heat map plot indicates that in most cases, except when the effect size in the male and female studies are equal or there is no background gene-by-sex effect, the RE2 approach is the most powerful. When there is neither a difference in effect sizes between males and females nor a gene-by-sex background effect, CV is most powerful and when there is no difference in effect sizes but there is a gene-by-sex background effect, FE is most powerful. One possible explanation for this observation is that the presence of many gene-by-sex interactions may increase the variance of the trait when both genders are analyzed together compared to if they are analyzed separately. This is consistent with our observation that the meta-analysis methods perform better than the combined methods when there is a large amount of gene-by-sex interactions.

213 0.25 0.25

0.18 0.18

RE2 FE 1.00 1.00 0.75 0.75 0.12 0.12 0.50 0.50 0.25 0.25

GxS Heritability 0.00 GxS Heritability 0.00

0.06 0.06

0.00 0.00

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 (a) Heterogeneity (b) Heterogeneity

0.25 0.25

0.18 0.18

CV CV FE RE2 1.00 1.00 0.75 0.75 0. 0.12 12 0.50 0.50 0.25 0.25 GxS Heritability GxS Heritability 0.00 0.00

0.06 0.06

0.00 0.00

0.00 0.25 0.50 0.75 1.00 0.00 0.25 0.50 0.75 1.00 (c) Heterogeneity (d) Heterogeneity

Figure 8.1: Power of (a) random-effect meta-analysis (RE2), (b) fixed-effect meta- analysis (FE) and (c) sex-as-covariate methods (CV) as a function of the GxS heri- tability and the heterogeneity between the male and the female study effect size. (d) shows a comparison of the three methods with the color corresponding to the method with the highest power.

8.2.3 Statistical power comparison between FE, UFE, GWAMA, WGWAMA, SST and RE2 approaches for Sex-differentiated meta-analysis

As we have shown, meta-analysis approaches have advantages in statistical power in the presence of gene-by-sex interactions compared to traditional approaches. The

214 question remains which meta-analysis method is the best approach to use. Unfortu- nately, there is no definite answer since which meta-analysis approach is most power- ful depends on the level of heterogeneity resulting from the gene-by-sex interaction. We show this by analyzing the power of FE, RE2, GWAMA (a previously proposed meta-analysis approach) and an approach which analyzes males and females separately (which we refer to as SST (Sex-Specific Test) which declares a loci associated if any sex-specific p-value is significant).

We compare the statistical power between FE, GWAMA, SST, and RE2 approaches by both analytically estimating the power and performing simulation experiments. The statistic for FE, GWAMA, and SST approach have known analytical, closed form dis- tributions. Thus we can easily compute the statistical power using a analytically de- rived formula as shown in section 8.4.1, 8.4.2, 8.4.3 and 8.4.4. Because the distribution of the statistics of RE2 does not have an analytical closed form formula, we generate the simulated data to estimate the statistical power of the method.

To assess the statistical power of the four methods, we first varies true effect sizes of male and female studies under the assumption that the number of samples in male and female studies are same. Figure 8.2 (A) shows the 50% power line of four methods (FE, SST, RE2, GWAMA). The 50% power line shows the male and female effect sizes where the method achieves exactly a power of 50%. Note that the power increases as the effect size increase. Thus the closer the 50% line is to the bottom leftmost point on the graph, the more powerful the method is. As expected, when one of the effect size is zero, the SST approach is the most powerful, One interesting observation in this power comparison experiment is that when the effect size are moderately similar between male and female study, FE and RE2 approach clearly outperform SST and GWAMA approach. Based on the graph, we feel that RE2 is a good balance between losing very little in terms of power when the effect sizes are close to each other compared to FE

215 while gaining a significant advantage over FE when the effect sizes are different.

We also perform a similar power comparison experiment when the number of in- dividuals are different in the female and male study. Specifically, we assume that there are twice as many females as males (Nfemale = 2Nmale). Figure 8.2 (B) shows the 50% power line of four methods (FE, SST, RE2, GWAMA) in the unbalanced study. The 50% power lines shows very similar patterns to the balanced study case. However when the female effect size is small and male effect size is large, there is a region that GWAMA outperform other methods. The reason for this phenomenon is that GWAMA ignores the unbalanced number of individuals between males and females. A major difference between GWAMA and the other meta-analysis methods is that GWAMA always weighs each study equally while the other studies take into account the differ- ences in the variance of the effect size estimates which are typically driven by differ- ences in sample sizes. It can be easily shown using the Cauchy-Schwartz inequality that the typical inverse variance weighting of the fixed-effect meta-analysis is optimal with respect to power. We extended GWAMA to allow for weighing each study and we set the weights in a similar fashion to optimize power under the same assumptions of the other meta-analysis approaches. We demonstrate our new method Weighted GWAMA (WGWAMA) by comparing to FE, RE2 and also an unweighted version of the fixed-effect meta-analysis (UFE) to gain intuitions on the effect of weights.

Figure 8.3 (A) shows the 50% power line of four methods (FE, UFE, GWAMA, WGWAMA, SST) under the balanced study assumption between male and female studies. As shown, unweighted FE (UFE) and weighted GWAMA (WGWAMA) shows same power as FE and GWAMA respectively, when the male and female studies are balanced. Figure 8.3 (B) shows the 50% power line of four methods (FE, UFE, GWAMA, WGWAMA, SST) in an unbalanced study where there are twice as many females as males. In this case, weighted GWAMA (WGWAMA) and FE are more

216 1 1 0.8 0.8 0.6 0.6 0.4 0.4 Female Effect Size Effect Female Size Effect Female ● FE ● FE 0.2 ● SST 0.2 ● SST ● RE2 ● RE2 ● ● ● GWAMA ● GWAMA 0 0

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Male Effect Size Male Effect Size (a) (b)

Figure 8.2: Statistical power comparison between FE, GWAMA, SST and RE2 ap- proaches while varying the effect sizes of males and females study. (a) The power comparison under the assumption of balanced number of individuals (b) and when there are twice as many females as males. The diagonal line shows the points where the effect sizes between males and females are equal. appropriate and natural approach in that it gives more weight to the larger studies.

217 1 1 0.8 0.8 0.6 0.6 0.4 0.4

● ● Female Effect Size Effect Female FE Size Effect Female FE ● UFE ● UFE 0.2 ● WGWAMA 0.2 ● WGWAMA ● GWAMA ● GWAMA ● ● ● SST ● SST 0 0

0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 Male Effect Size Male Effect Size (a) (b)

Figure 8.3: Statistical power comparison between FE, UFE, WGWAMA, GWAMA, and SST approaches while varying the effect sizes of male and female study. (a) The power comparison under the assumption of balanced number of individuals (b) and when there are twice as many females as males. The diagonal line shows the points where the effect sizes between males and females are equal.

218 8.3 Discussion

We have presented Meta-sex, a novel powerful meta-analytic approach for the analysis of genome-wide association study containing the sex-specific effects. Meta-sex utilize the random effect based meta-analysis approach, which allows for the heterogene- ity in the effect size between the male and the female studies leading to statistically powerful compared to the traditional approach, which considers the sex-as-covariates. Through simulations and analysis of the NFBC data, we have shown that our proposed approachachieves higher power than traditional approaches.

One significant challenge encountered in applying meta-analysis is the potential for between study population structure, which in this case corresponds to genetic re- lationships between males and females in the original study resulting in correlation between the effect size estimates which can lead to inflation and spurious associations. The traditional sex-as-covariate approach would apply a mixed model to the combined sample that would model these relationships through a kinship matrix estimated from the genotypes. Unfortunately, if we are analyzing the males and females separately, we are unable to correct for this between sex population structure. One simple way to address this issue is to compute genomic control lambda values [DR99]. and check to see if there is any inflation. If there is, genomic control can be applies to the combined statistics. A more sophisticated approach would be to apply the method of Sul et al., (2013), which utilizes a mixed model and the kinship matrix between individuals in each study to correct for the correlation between effect size estimates directly when performing a meta-analysis. In our experiments, we did not observe any inflation of the combined statistics.

219 8.4 Methods

8.4.1 Fixed effects model meta analysis for sex-differentiated analysis

In standard fixed effect model meta analysis, we first estimate the genetic effect size of male and female study separately in the linear model.

yf = µf + βf Xf + ef (8.1)

ym = µm + βmXm + em (8.2)

2 2 Given the estimates of genetic effect size (βm, βf ) and their variance (σm, σf ), In the fixed effects model meta-analysis, we assume that the underlying effect sizes are the same in male and female study as β (β = βm = βf ). The best estimate of β is the inverse-variance weighted effect size [BFJ08],

w β + w β 1 β¯ = m m f f ,SE(β¯) = √ (8.3) wm + wf wm + wf ¯ β wmβm + wf βf ZFE = = √ (8.4) SE(β¯) wm + wf 2 2 where wm = 1/σm and wf = 1/σf so-called inverse variance. Then we can test the null hypothesis β = 0 versus the alternative hypothesis β 6= 0.

8.4.2 Unweighted fixed effects model meta analysis for sex-differentiated analy- sis

Unlike the standard fixed effect model meta analysis, unweighted fixed effects meta analysis assigns same weights to both the male and the female study regardless of the variance of the effect size estimates. Given the estimates of genetic effect size (βm, βf )

2 2 and their variance (σm, σf ) from Equation (8.1) and (8.2), We compute the z-scores

220 for each male and female study,

βm βf Zm = 2 ,Zf = 2 . (8.5) σm σf

Then unweighted fixed effect z-score can be obtained by equally weighting Zm and

Zf . wZm + wZf Zm + Zf ZUFE = √ = √ (8.6) w2 + w2 2 The p-value of unweighed fixed effect model meta analysis can be easily computed, as

ZUFE follows standard normal distribution under the null hypothesis that βm = 0 and

βf = 0.

8.4.3 SST approach for sex-differentiated analysis

The SST approach is similar to sex-specific association test. We first estimate the ge-

2 2 netic effect size (βm, βf ) and its standard error (σm, σf ) in male and female study sep- arately in the linear model as in equation (8.1). Then we compute the z-score statistic

βm βf of male and female study separately by zm = 2 and zf = 2 . Then the SST statistic σm σf is chosen by maximum between absolute value of z-scores (Smax = max(|zm|, |zf |)). p-value can be obtained with Smax under the null hypothesis of standard normal dis- tribution.

8.4.4 GWAMA approach for sex-differentiated analysis

In the GWAMA approach [MLM10], we first estimate the genetic effect size (βm, βf )

2 2 and its standard error (σm, σf ) in male and female study separately in the linear model as in equation (8.1). Then we compute the χ2 statistic of male and female study sep-

2 2 βm 2 2 2 βf 2 arately by squaring z-scores ( χm = zm = ( 2 ) and χf = zf = ( 2 ) ). GWAMA σm σf 2 2 2 2 statistic can be obtained by summing male χ and female χ (SGW AMA = χm + χf ). 2 p-value can be obtained with SGW AMA under the null hypothesis of central χ distri-

221 bution with 2 degrees of freedom.

8.4.5 Weighted GWAMA approach for sex-differentiated analysis

Unlike the traditional GWAMA approach, weighted GWAMA (WGWAMA) seeks the optimal weight with respect to the power, when combining two χ2 statistics from the male and female study. The relationship between WGWAMA and GWAMA is analo- gous to the that between FE and unweighted FE. Similar to FE, WGWAMA assumes the fixed effect at the causal loci in both the male and the female study. Under this assumption, WGWAMA performs numerical grid search for optimal weights to com- bine two χ2 statistics. For each weights, it first generates 2 million null χ2 statistics by squaring z-scores sampled from standard normal distribution. Subsequently we can obtain 1 million null statistics by combining two null χ2 statistics with given weights. Given significant level α,(α × 1 mil)-th statistic is the significant threshold for the given weights. Then we generate 10000 alternative statistics. For this, we first gener- √ 2 ate 10000 alternative χ statistics sampled from normal distribution with mean Nm p and Nf , where Nm and Nf are the number of samples for the male study and the female study respectively. Then the power for the specific weights can be computed by the proportion of alternative statistics more significant than the threshold for the weights. The above power estimation is performed for wide range of weights then the most powerful weights can be chosen for the WGWMA approach.

8.4.6 Meta-sex approach for sex-differentiated analysis

In the Meta-sex approach, random effects model is assumed. In other words, we as- sume that the effect size of male and female studies βi (i = m, f) is a random variable

222 followed by a distribution with the grand mean β and the variance τ 2 [HE11b, DL86a],

2 βi ∼ N(β, τ ).

This random effects model tests the null hypothesis β = 0 and τ 2 = 0 versus the alternative hypothesis β 6= 0 or τ 2 6= 0. Similarly to the traditional random effect model [DL86a], we use the likelihood ratio framework considering each statistic (βi and σi pair) as a single observation. Here βi and σi represents the effect size estimate and its standard error respectively. However to avoid the problem of conservative nature of traditional random effects model, we assume no heterogeneity under the null hypothesis [HE11b], which is natural because the effect size should be fixed to be zero under the null hypothesis The likelihoods are then

 2  Y 1 βi L0 = exp − p 2 2σ2 i=m,f 2πσi i  2  Y 1 (βi − µ) L1 = exp − . p 2 2 2(σ2 + τ 2) i=m,f 2π(σi + τ ) i

The maximum likelihood estimates µˆ and τˆ2 can be found by an iterative procedure suggested by Hardy and Thompson [HT96]. Then the likelihood ratio test statistic can be built

 2  2 2 X σi X βi X (βi − µˆ) Smeta = −2 log(λ) = log 2 2 + 2 − 2 2 . (8.7) σi +τ ˆ σi σi +τ ˆ p-value can be calculated efficiently using pre-computed tabulated values [HE11b].

8.4.7 Controlling for population structure within studies

Genetic similarities between individuals, called the population structure or cryptic re- latedness, is well-known for both inhibiting the ability to find true associations and causing the appearance of a large number of false or spurious associations [DRB01, VP05]. One popular method to correct this problem is employing mixed models[Lan02,

223 YPB06, KZW08, KSS10, LQK13, LLK13]. These mixed model approaches introduce a random variable into the traditional linear model.

Yi = µ + βiXr + ui +  (i = m, f) (8.8)

In the model in equation (8.8), the random variable ui represents the vector of genetic contributions to the phenotype for individuals in study i. This random variable

2 is assumed to follow a normal distribution with ui ∼ N(0, σg Ki), where Ki is the ni × ni kinship coefficient matrix for study i. With this assumption, the total variance

2 2 of Yi is given by Σi = σg Ki + σe I. A z-score statistic is derived for the test βi = 0 by ˆ noting the distribution of the estimate of βi. In order to avoid complicated notation, we introduce a more basic matrix form of the model in equation (8.8), shown in equation (8.9).

Yi = SiΓ + ui +  (8.9)

In equation (8.9), Si is a ni × 2 matrix encoding the global mean and SNP vectors and Γ is a 2 × 1 coefficient vector. The maximum likelihood estimate for Γ in study ˆ 0 −1 −1 0 −1 i is given by Γi = (SiΣi Si) SiΣi Yi which follows a normal distribution with a 0 −1 −1 mean equal to the true Γ and variance (SiΣi Si) . The estimates of the effect size βi and standard error of the βi (σi) are then given in equation (8.10) and equation (8.11), ˆ where R = [0 1] is a vector used to select the appropriate entry in the vector Γi.

0 −1 −1 0 −1 βi = R(SiΣi Si) SiΣi Yi (8.10)

0 −1 −1 0 1/2 σi = [R(SiΣi Si) R ] (8.11)

The above mixed model based approach can be utilized in conjunction with any of above sex-differentiated methods such as fixed effects meta analysis, SST approach,

224 GWAMA approach, Meta-Sex approach to control the population structure in each male and female study.

8.4.8 Computation of statistical power of the methods

8.4.8.1 Analytical power computation for FE, SST and GWAMA methods

Let α be the p-value threshold, which will control the false positive rate of sex- differentiated association testing. For the analytical power computation of fixed effect meta analysis approach, let Φ(x; µ, σ) be the CDF of normal distribution for statistic x given mean µ and standard deviation σ of the distribution. Since FE test is two-sided

α −1 α test, the threshold statistic can be obtained by inverse CDF of 2 (Tα = Φ ( 2 ; 0, 1)).

Given effect size estimates (βm, βf ) and their standard errors (σm, σf ) from male and female study, the non-centrality parameter can be easily obtained by equation (8.4) wf βf + wmβm (SFE = √ ). Given Tα and SFE we derive the formula for analytical wf + wm power of fixed effect model meta analysis.

P owerFE = Φ(Tα; SFE, 1) + 1 − Φ(−Tα; SFE, 1) (8.12)

The statistical power for SST approach is the probability that the hypothesis test- ings for both male and female study are not rejected given zm and zf , Tα. This quantity can be analytically computed by the following equation.

  P owerMax = 1 − 1 − Φ(Tα; zm, 1) − (1 − Φ(−Tα, zm, 1))

×   1 − Φ(Tα; zf , 1) − (1 − Φ(−Tα, zf , 1))

225 To derive the statistical power formula for GWAMA approach, we first define

2 Fχ2 (x; m, k) as CDF of χ distribution with non-centrality parameter m and k de- gree of freedom. Since GWAMA statistic follows central χ2 distribution with 2 degree of freedom under the null hypothesis, given the p-value threshold α for controlling false positive rate, the threshold statistic for GWAMA approach can be expressed as

−1 Tα = Fχ2 (1 − α; 0, 2). The non-centrality parameter for GWAMA approach can be 2 2 computed by SGW AMA = zm + zf , as explained in the section 8.4.4. Then the sta- tistical power of GWAMA approach can be computed analytically with the following formula.

P owerGW AMA = Fχ2 (Tα; SGW AMA, 2) (8.13)

8.4.8.2 Empirical power estimation for Meta-Sex methods

Unlike FE, GWAMA, SST methods, the analytical distribution of the Meta-Sex statis- tic is unknown. Thus to estimate the statistical power of the Meta-Sex approach, we employ simulation approach. Given true effect sizes and standard errors of male and female, we sample N number of male and female effect sizes.N number of p-values are computed as explained in section 8.4.6. Then the statistical power is computed as the proportion of p-values that are more significant than predefined genome-wide significant threshold such as 5.0 × 10−7.

8.4.9 Simulating GxS interactions

In order to simulate GxS interactions, we sample from the appropriate multivariate normal distribution. We are given the kinship matrices Km, Kf and Ka, for males, female and all individuals, respectively. The sex-specific kinship matrix Ks is then given by the following.

226   Km 0    Ks =   (8.14)   0 Kf

Given a heritability h and a GxS heritability hs, a simulated phenotype is generated by sampling from a multivariate normal distribution with mean zero and variance hKa + hsKs + (1 − h − hs)I. The intuition is that shared genetics between indi- viduals accounts on average for some proportion of the overall phenotypic variance (h), while the proportion of variance accounted for by genetics for only males or only females is h + hs. That is, same sex individuals have a similar genetic contribution to their overall trait value and this contribution is different when comparing across sexes.

8.4.10 Computing heritability of GxS interactions

We assume an additive model for the genetic and GxS interaction effects and estimate the phenotypic variation due to these effects using a linear mixed model. Let N be the number of individuals and M be the number of genotypes collected in the study.

y = Xβ + g + gs + e, (8.15) where y is the N × 1 vector of the observed phenotypic values from the individu- als. X is the N × K matrix of covariates, including the intercept and the sex, and β is the K × 1 vector of coefficients for the covariates. We treat g, gs, e as random effects. Let x be the N × M matrix representing SNP matrix and xi denotes the

N × 1 normalized genotype vector of SNP i and the components of xi are encoded as p p p {2−2pi/ 2pi(1 − pi), 1−2pi/ 2pi(1 − pi), −pi/ 2pi(1 − pi)} for minor allele ho- mozygous, heterozygous, and major allele homozygous (pi is the observed minor allele

2 T frequency). Then we assume g ∼ N(0, σg Kg), where Kg = xx . The second ran- dom effect variable gs denotes the background gene-by-sex interaction effect and we

227 2 assume gs ∼ N(0, σgsKgs), in which Kgs representing covariance structure for GxS.

For the pairs of individuals with same sex, the covariance will be the covariance in Kg but for the pairs of individuals with different sex, the covariance will be 0. Finally,

2 e denotes the N × 1 vector of environmental effects and we assume e ∼ N(0, σe I), 2 2 2 in which I is the identity matrix. After estimating σg , σgs, σe by restricted maximum likelihood using GCTA software [YBM10, LWG11] Then the gene-by-sex (GxS) her- σˆ2 ˆ gs itability can be computed by hgs = . ˆ2 ˆ2 ˆ2 σg + σgs + σe

228 8.5 Supplementary Materials

QQplots for NFBC data

229 Figure S1: QQplots for BMI phenotype

230 Figure S2: QQplots for Crp phenotype

231 Figure S3: QQplots for Dia phenotype

232 Figure S4: QQplots for Glu phenotype

233 Figure S5: QQplots for Hdl phenotype

234 Figure S6: QQplots for height phenotype

235 Figure S7: QQplots for ins phenotype

236 Figure S8: QQplots for ldl phenotype

237 Figure S9: QQplots for sys phenotype

238 Figure S10: QQplots for tg phenotype

239 CHAPTER 9

Conclusion

In this thesis, I presented several computational genetic approaches for analyzing ge- netic data, including that from humans as well as model organisms such as mice. In Chapter 2, I introduced a method to infer biological networks from high-throughput data including both genetic variation and gene expression profiles. The key insight employed in the proposed method is utilizing the prior knowledge that genetic varia- tion affects gene expression, but gene expression cannot affect genetic variation. This knowledge allows us to force the direction of the causal relationship between genetic variation and gene expression, which in turn allows us to determine the causal direc- tionality between two gene expression, using the basic properties of causality theory. Moreover, we can also rule out certain directed causal relationships between gene expressions using the same principals, thus identifying the absence of certain causal relationships. The proposed method combines a principled representation of causal- ity using graphical causal models with small sample statistical methods to infer the presence and absence of causal relationships between yeast genes. To overcome the problem of limited sample size, we performed a likelihood ratio test instead of per- forming a conditional independence test directly. The advantage of our approach is the ability to distinguish between direct and indirect variations, and minimizing false positives by inferring the absence of causal relationships. These results provide a rich description of the yeast gene regulation network beyond previous results from mapping studies and co-expression network analysis. The interesting extension to this approach

240 is developing advanced method which can be utilized in mouse model organism. This extension requires incorporating random effects in the model to correct for batch effect and population structure which exists in the mouse model organism data.

Next I presented an efficient dynamic programming algorithm to compute the marginalized posterior probability of causal graphical features, such as adjacencies and v-structures. The goal of inferring causal relationships from observational data is to identify the exact data generating model. However inferring causal graphical features has a fundamental limitation if we only use observational data. That is, the best we can infer with observational data is the Markov equivalence class of the data generating model. Previous approaches to this problem have computed the marginalized posterior probability of directed edges or adjacencies between the variables of interest. How- ever, the posterior probability of directed edges in the graphical model is dependent on the shape of the true data generating model, which consequently decides the Markov equivalence class. As proven in Chapter 3, the scale of the posterior probability of directed edge features will not be the same across all directed graphical edge features. One problem of this phenomenon is that it becomes difficult to decide on a single threshold for posterior probability to claim the existence of the directed edge features. The solution we proposed is computing the posterior probability of graphical features defining the Markov equivalence class, such as adjacencies and v-structures, instead of considering directed edges. Based on this theoretical justification, we devised the dynamic programming algorithm to efficiently compute the posterior probability of the existence of adjacencies and v-structures.

I also introduced meta-analysis to combine multiple genome-wide association stud- ies, where each study contained a population structure. Meta-analysis is a popular tool to combine multiple studies to achieve a higher statistical power. In addition, we can also combine studies at the summary statistic level, making it easier for data to be

241 shared when there is a limitation in the data sharing process. The method of combin- ing multiple studies at the summary statistic level for optimal statistical power when each study does not contain population structure is very well known. In Chapter 4, a solution to the problem of performing meta-analysis was proposed, while each study contains a different. Population structure is well known to cause spurious associa- tion signals and false positives in genome-wide association studies, especially those involved in model organism. In our example of two different meta-analysis studies (mouse HDL and BMD), we showed the utility of the proposed approach. Interesting observation of the application of the proposed meta-analysis approach to these mouse studies is that we demonstrated increased resolution as well as increased statistical power in the meta-analysis results compared to any single study.

Identity by descent (IBD) is the foundation for many of the important problems in genetics, including the determination of the haplotype phase, understanding familial diseases, detecting population structure and heritability estimation. Segmental IBD sharing indicates that the common ancestor is relatively recent. Two individuals share a haplotype segment identical by descent when the haplotype is inherited from a recent common ancestor without recombination. If the segmental IBD haplotype contains the disease causing mutation, then the individuals who share this particular IBD segment are likely to share the disease as well. The IBD mapping is a method searching for the IBD segments containing the disease causing mutation. The major bottleneck of the traditional approach to this problem is that it requires permutation to estimate the sig- nificance of the association to pass the stringent genome-wide threshold (1.0 × 10−7), which takes years to compute. In Chapter 5, I introduced the importance sampling ap- proach to estimate the significance efficiently, which reduces computation time from years to days. Another important benefit of the proposed method is that we can perform fine mapping, which helps to pinpoint the candidate causal gene. We demonstrated the applicability of our approach to the fine mapping problem with WTCCC type 1 dia-

242 betes data.

Identifying environmentally specific genetic effects is a key challenge in under- standing the structure of complex traits. The traditional approach of identifying gene- by-environment interactions utilizes the linear model and includes the interaction ef- fect as a covariate in the model. In this linear model, a Wald test based on a hypothesis testing framework is performed to determine whether there is a non-zero coefficient for the gene-by-environment interaction variable. To discover gene-by-environment inter- actions using this traditional framework requires several challenges to be overcome. First, a much larger number of samples are required to identify gene-by-environment interactions, to achieve similar statistical powers to simple GWA studies, because more degrees of freedom for environmental variables is required in the model. Second, iden- tifying gene-by-environment interaction with the traditional approach requires prior knowledge of environmental variables. Specifically, before setting up the linear model for gene-by-environment interaction, we must know what kind of variable (e.g. sex, age, gene knockout) to be considered in the model, and how we encode these vari- ables in the model (for example, binary or continuous values). In many cases, it is difficult to predict the gene-by-environment interaction we are expecting. Thus the kind of environments that should be incorporated in the model can be difficult to de- termine. In addition, the encoding of the environmental variables affects the statistical power of the study. In Chapter 6, I proposed a meta-analytic approach to identify gene-by-environment interaction, which can address these challenges. The proposed approach utilizes the random effect model of meta-analysis to combine studies with varying environmental conditions to identify gene-by-environment interactions. The major advantage of the proposed approach over the traditional approach is that it does not require explicit modeling of the environmental variables in the model, since we combine the studies with varying environments with meta analytic approach. In ad- dition, by combining studies with a meta-analytical approach, we can easily combine

243 multiple studies at the summary statistic level, which allow us to perform large scale meta-analysis to achieve higher statistical powers. As demonstrated in the mouse HDL study, the method has exploratory aspects, meaning that it can discover the unexpected gene-by-environment interaction. Our proposed visualization framework allows the researcher to gain biological insights on gene-by-environment interactions.

In Chapter 7, I also analyzed age-related hearing loss data from mice, using a meta- analytic approach. Only a limited number of large-scale association studies for ARHI have been undertaken in humans, to date. In human studies it is difficult to control the environments affecting the ARHI, which might lead to noise in the genome-wide association study. An alternate and complementary approach to these human studies is the use of mouse models. Advantages of mouse models include more control over the environment, the ability to replicate measurements in genetically identical animals, and an increase in the proportion of the variability explained by genetic variation. Motivated by these advantages, the presented study in Chapter 7 is the first genome- wide association study of its kind by combining several data sets in a meta-analysis to identify loci associated with age-related hearing loss in mice. We identified 5 genome- wide significant loci, one of which has a peak located in the previously implicated region. This application study suggests that we can achieve high resolution and power in genome-wide association study by combining multiple independent studies.

Lastly, I introduced the powerful random effect-based meta-analytic approach for genome-wide association studies with sex-specific effects. Many genome-wide asso- ciation studies in humans and model organisms show sex-specific effects. The preva- lence of sex-specific effects complicates the analysis of association studies consisting of both males and females. The question that I addressed in Chapter 8 was how to achieve optimal power in such genome-wide association studies. In Chapter 8, I com- pared many different meta-analytic approaches and traditional covariate approaches

244 to this problem of GWAS with sex-specific effects. Our extensive simulation exper- iments show the behavior of the different approaches in various scenarios, including heterogeneity between male and female effect sizes, various numbers of sample sizes between male and female studies, and the existence of various background gene-by- sex interaction cases. We proposed random effect model meta-analysis which analyzes male and female studies separately, and combine the summary statistics with the meta- analytic approach. We applied our method to the Northern Finland Birth Cohort data and demonstrated increased power over the traditional approach, while controlling for false positives.

245 REFERENCES

[ADL08] David Altshuler, Mark J. Daly, and Eric S. Lander. “Genetic mapping in human disease.” Science, 322(5903):881–8, 11 2008.

[AVF11] David L. Aylor, William Valdar, Wendy Foulds-Mathes, Ryan J. Buus, Ricardo A. Verdugo, Ralph S. Baric, Martin T. Ferris, Jeff A. Frelinger, Mark Heise, Matt B. Frieman, Lisa E. Gralinski, Timothy A. Bell, John D. Didion, Kunjie Hua, Derrick L. Nehrenberg, Christine L. Powell, Jill Steigerwalt, Yuying Xie, Samir N. P. Kelada, Francis S. Collins, Ivana V. Yang, David A. Schwartz, Lisa A. Branstetter, Elissa J. Chesler, Darla R. Miller, Jason Spence, Eric Yi Liu, Leonard McMillan, Abhishek Sarkar, Jeremy Wang, Wei Wang, Qi Zhang, Karl W. Broman, Ron Korstanje, Caroline Durrant, Richard Mott, Fuad A. Iraqi, Daniel Pomp, David Threadgill, Fernando Pardo-Manuel de Villena, and Gary A. Churchill. “Genetic analysis of complex traits in the emerging Collaborative Cross.” Genome Res, 21(8):1213–22, 8 2011.

[AWG13] Verneri Anttila, Bendik S. Winsvold, Padhraig Gormley, Tobias Kurth, Francesco Bettella, George McMahon, Mikko Kallela, Rainer Malik, Boukje de Vries, Gisela Terwindt, Sarah E. Medland, Unda Todt, Wendy L. McArdle, Lydia Quaye, Markku Koiranen, M. Arfan Ikram, Terho Lehtimäki, Anine H. Stam, Lannie Ligthart, Juho Wedenoja, Ian Dunham, Benjamin M. Neale, Priit Palta, Eija Hamalainen, Markus Schürks, Lynda M. Rose, Julie E. Buring, Paul M. Ridker, Stacy Steinberg, Hreinn Stefansson, Finnbogi Jakobsson, Debbie A. Lawlor, David M. Evans, Susan M. Ring, Markus Färkkilä, Ville Artto, Mari A. Kaunisto, Tobias Freilinger, Jean Schoenen, Rune R. Frants, Nadine Pelzer, Claudia M. Weller, Ronald Zielman, Andrew C. Heath, Pamela A. F. Madden, Grant W. Montgomery, Nicholas G. Martin, Guntram Borck, Hartmut Göbel, Axel Heinze, Katja Heinze-Kuhn, Frances M. K. Williams, Anna-Liisa L. Hartikainen, Anneli Pouta, Joyce van den Ende, Andre G. Uitterlinden, Albert Hofman, Najaf Amin, Jouke-Jan J. Hot- tenga, Jacqueline M. Vink, Kauko Heikkilä, Michael Alexander, Bertram Muller-Myhsok, Stefan Schreiber, Thomas Meitinger, Heinz Erich Wich- mann, Arpo Aromaa, Johan G. Eriksson, Bryan J. Traynor, Daniah Tra- bzuni, Elizabeth Rossin, Kasper Lage, Suzanne B. R. Jacobs, J. Raphael Gibbs, , Jaakko Kaprio, Brenda W. Penninx, Dorret I. Boomsma, Cornelia van Duijn, Olli Raitakari, Marjo-Riitta R. Jarvelin, John-Anker A. Zwart, Lynn Cherkas, David P. Strachan, Christian Ku- bisch, Michel D. Ferrari, Arn M. J. M. van den Maagdenberg, Mar-

246 tin Dichgans, Maija Wessman, George Davey Smith, Kari Stefansson, Mark J. Daly, Dale R. Nyholt, Daniel I. Chasman, and Aarno Palotie. “Genome-wide meta-analysis identifies new susceptibility loci for mi- graine.” Nat Genet, 6 2013.

[BB09] Brian L. Browning and Sharon R. Browning. “A unified approach to genotype imputation and haplotype-phase inference for large data sets of trios and unrelated individuals.” Am J Hum Genet, 84(2):210–23, 2 2009.

[BB10] Sharon R. Browning and Brian L. Browning. “High-resolution detec- tion of identity by descent in unrelated individuals.” Am J Hum Genet, 86(4):526–39, 4 2010.

[BB11] Brian L. Browning and Sharon R. Browning. “A fast, powerful method for detecting identity by descent.” Am J Hum Genet, 88(2):173–82, 2 2011.

[BBK11] J.A. Blake, C.J. Bult, J.A. Kadin, J.E. Richardson, and J.T. Eppig. “The Mouse Genome Database (MGD): premier model organism resource for mammalian genomics and genetics.” Nucleic acids research, 39(suppl 1):D842, 2011.

[BCG89] J. M. Bourre, M. Clément, D. Gérard, and J. Chaudiére. “Al- terations of cholesterol synthesis precursors (7-dehydrocholesterol, 7- dehydrodesmosterol, desmosterol) in dysmyelinating neurological mu- tant mouse (quaking, shiverer and trembler) in the PNS and the CNS.” Biochim Biophys Acta, 1004(3):387–90, 8 1989.

[BFJ08] P.I.W. de Bakker, M.A.R. Ferreira, X. Jia, B.M. Neale, S. Raychaud- huri, and B.F. Voight. “Practical aspects of imputation-driven meta- analysis of genome-wide association studies.” Human molecular genet- ics, 17(R2):R122, 2008.

[BFO10a] Brian J. Bennett, Charles R. Farber, Luz Orozco, Hyun Min Kang, Ana- tole Ghazalpour, Nathan Siemers, Michael Neubauer, Isaac Neuhaus, Roumyana Yordanova, Bo Guan, Amy Truong, Wen-pin P. Yang, Aiqing He, Paul Kayne, Peter Gargalovic, Todd Kirchgessner, Calvin Pan, Lawrence W. Castellani, Emrah Kostem, Nicholas Furlotte, Thomas A. Drake, Eleazar Eskin, and Aldons J. Lusis. “A high-resolution associa- tion mapping panel for the dissection of complex traits in mice.” Genome Res, 20(2):281–90, 2 2010.

[BFO10b] Brian J. Bennett, Charles R. Farber, Luz Orozco, Hyun Min Kang, Ana- tole Ghazalpour, Nathan Siemers, Michael Neubauer, Isaac Neuhaus,

247 Roumyana Yordanova, Bo Guan, Amy Truong, Wen-pin P. Yang, Aiqing He, Paul Kayne, Peter Gargalovic, Todd Kirchgessner, Calvin Pan, Lawrence W. Castellani, Emrah Kostem, Nicholas Furlotte, Thomas A. Drake, Eleazar Eskin, and Aldons J. Lusis. “A high-resolution associa- tion mapping panel for the dissection of complex traits in mice.” Genome Res, 20(2):281–90, 2 2010.

[BH05] Nan Bing and Ina Hoeschele. “Genetical Genomics Analysis of a Yeast Segregant Population for Transcription Network Inference.” Genetics, 170:533–542, June 2005.

[BJC12] Vesna Boraska, Ana Jeronciˇ c,´ Vincenza Colonna, Lorraine Southam, Dale R. Nyholt, Nigel William Rayner, John R. B. Perry, Daniela To- niolo, Eva Albrecht, Wei Ang, Stefania Bandinelli, Maja Barbalic, Inês Barroso, Jacques S. Beckmann, Reiner Biffar, Dorret Boomsma, Harry Campbell, Tanguy Corre, Jeanette Erdmann, Tõnu Esko, Krista Fischer, Nora Franceschini, Timothy M. Frayling, Giorgia Girotto, Juan R. Gon- zalez, Tamara B. Harris, Andrew C. Heath, Iris M. Heid, Wolfgang Hoff- mann, Albert Hofman, Momoko Horikoshi, Jing Hua Zhao, Anne U. Jackson, Jouke-Jan J. Hottenga, Antti Jula, Mika Kähönen, Kay-Tee T. Khaw, Lambertus A. Kiemeney, Norman Klopp, Zoltán Kutalik, Vasi- liki Lagou, Lenore J. Launer, Terho Lehtimäki, Mathieu Lemire, Marja- Liisa L. Lokki, Christina Loley, Jian’an Luan, Massimo Mangino, Irene Mateo Leach, Sarah E. Medland, Evelin Mihailov, Grant W. Mont- gomery, Gerjan Navis, John Newnham, Markku S. Nieminen, Aarno Palotie, Kalliope Panoutsopoulou, Annette Peters, Nicola Pirastu, Ozren Polasek, Karola Rehnström, Samuli Ripatti, Graham R. S. Ritchie, Fer- nando Rivadeneira, Antonietta Robino, Nilesh J. Samani, So-Youn Y. Shin, Juha Sinisalo, Johannes H. Smit, Nicole Soranzo, Lisette Stolk, Dorine W. Swinkels, Toshiko Tanaka, Alexander Teumer, Anke Tön- jes, Michela Traglia, Jaakko Tuomilehto, Armand Valsesia, Wiek H. van Gilst, Joyce B. J. van Meurs, Albert Vernon Smith, Jorma Viikari, Jacqueline M. Vink, Gerard Waeber, Nicole M. Warrington, Elisabeth Widen, Gonneke Willemsen, Alan F. Wright, Brent W. Zanke, Lina Zgaga, Michael Boehnke, Adamo Pio d’Adamo, Eco de Geus, Ellen W. Demerath, Martin den Heijer, Johan G. Eriksson, Luigi Ferrucci, Chris- tian Gieger, Vilmundur Gudnason, Caroline Hayward, Christian Heng- stenberg, Thomas J. Hudson, Marjo-Riitta R. Järvelin, Manolis Kogev- inas, Ruth J. F. Loos, Nicholas G. Martin, Andres Metspalu, Craig E. Pen- nell, Brenda W. Penninx, Markus Perola, Olli Raitakari, Veikko Salomaa, Stefan Schreiber, Heribert Schunkert, Tim D. Spector, Michael Stumvoll, André G. Uitterlinden, Sheila Ulivi, Pim van der Harst, Peter Vollenwei-

248 der, Henry Völzke, Nicholas J. Wareham, H-Erich E. Wichmann, James F. Wilson, Igor Rudan, Yali Xue, and Eleftheria Zeggini. “Genome-wide meta-analysis of common variant differences between men and women.” Hum Mol Genet, 21(21):4805–15, 11 2012.

[BK05] Rachel Brem and Leonid Kruglyak. “The landscape of genetic complex- ity across 5,700 gene expression traits in yeast.” Proceedings of the Na- tional Academy of Sciences, 102:1572–1577, February 2005.

[BSJ13] Sonja I. Berndt, Christine F. Skibola, Vijai Joseph, Nicola J. Camp, Alexandra Nieters, Zhaoming Wang, Wendy Cozen, Alain Monnereau, Sophia S. Wang, Rachel S. Kelly, Qing Lan, Lauren R. Teras, Ni- lanjan Chatterjee, Charles C. Chung, Meredith Yeager, Angela R. Brooks-Wilson, Patricia Hartge, Mark P. Purdue, Brenda M. Birmann, Bruce K. Armstrong, Pierluigi Cocco, Yawei Zhang, Gianluca Severi, Anne Zeleniuch-Jacquotte, Charles Lawrence, Laurie Burdette, Jeffrey Yuenger, Amy Hutchinson, Kevin B. Jacobs, Timothy G. Call, Tait D. Shanafelt, Anne J. Novak, Neil E. Kay, Mark Liebow, Alice H. Wang, Karin E. Smedby, Hans-Olov O. Adami, Mads Melbye, Bengt Glimelius, Ellen T. Chang, Martha Glenn, Karen Curtin, Lisa A. Cannon-Albright, Brandt Jones, W. Ryan Diver, Brian K. Link, George J. Weiner, Lucia Conde, Paige M. Bracci, Jacques Riby, Elizabeth A. Holly, Martyn T. Smith, Rebecca D. Jackson, Lesley F. Tinker, Yolanda Benavente, Niko- laus Becker, Paolo Boffetta, Paul Brennan, Lenka Foretova, Marc May- nadie, James McKay, Anthony Staines, Kari G. Rabe, Sara J. Achenbach, Celine M. Vachon, Lynn R. Goldin, Sara S. Strom, Mark C. Lanasa, Lo- gan G. Spector, Jose F. Leis, Julie M. Cunningham, J. Brice Weinberg, Vicki A. Morrison, Neil E. Caporaso, Aaron D. Norman, Martha S. Linet, Anneclaire J. De Roos, Lindsay M. Morton, Richard K. Severson, Elio Riboli, Paolo Vineis, Rudolph Kaaks, Dimitrios Trichopoulos, Giovanna Masala, Elisabete Weiderpass, María-Dolores D. Chirlaque, Roel C. H. Vermeulen, Ruth C. Travis, Graham G. Giles, Demetrius Albanes, Jarmo Virtamo, Stephanie Weinstein, Jacqueline Clavel, Tongzhang Zheng, Theodore R. Holford, Kenneth Offit, Andrew Zelenetz, Robert J. Klein, John J. Spinelli, Kimberly A. Bertrand, Francine Laden, Edward Gio- vannucci, Peter Kraft, Anne Kricker, Jenny Turner, Claire M. Vajdic, Maria Grazia Ennas, Giovanni M. Ferri, Lucia Miligi, Liming Liang, Joshua Sampson, Simon Crouch, Ju-Hyun H. Park, Kari E. North, Angela Cox, John A. Snowden, Josh Wright, Angel Carracedo, Carlos Lopez- Otin, Silvia Bea, Itziar Salaverria, David Martin-Garcia, Elias Campo, Joseph F. Fraumeni, Silvia de Sanjose, Henrik Hjalgrim, James R. Cerhan, Stephen J. Chanock, Nathaniel Rothman, and Susan L. Slager. “Genome-

249 wide association study identifies multiple risk loci for chronic lympho- cytic leukemia.” Nat Genet, 6 2013. [BT12] Sharon R. Browning and Elizabeth A. Thompson. “Detecting Rare Vari- ant Associations by Identity-by-Descent Mapping in Case-Control Stud- ies.” Genetics, 190(4):1521–31, 4 2012. [BYC02] Rachel B. Brem, Gael Yvert, Rebecca Clinton, and Leonid Kruglyak. “Genetic Dissection of Transcriptional Regulation in Budding Yeast.” Science, 296:752–755, April 2002. [BYP05] Paul I. W. de Bakker, Roman Yelensky, Itsik Pe’er, Stacey B. Gabriel, Mark J. Daly, and David Altshuler. “Efficiency and power in genetic association studies.” Nat Genet, 37(11):1217–23, 11 2005. [CAA04] Gary A. Churchill, David C. Airey, Hooman Allayee, Joe M. Angel, Alan D. Attie, Jackson Beatty, William D. Beavis, John K. Belknap, Beth Bennett, Wade Berrettini, Andre Bleich, Molly Bogue, Karl W. Broman, Kari J. Buck, Ed Buckler, Margit Burmeister, Elissa J. Chesler, James M. Cheverud, Steven Clapcote, Melloni N. Cook, Roger D. Cox, John C. Crabbe, Wim E. Crusio, Ariel Darvasi, Christian F. Deschepper, R. W. Doerge, Charles R. Farber, Jiri Forejt, Daniel Gaile, Steven J. Garlow, Hartmut Geiger, Howard Gershenfeld, Terry Gordon, Jing Gu, Weikuan Gu, Gerald de Haan, Nancy L. Hayes, Craig Heller, Heinz Himmelbauer, Robert Hitzemann, Kent Hunter, Hui-Chen C. Hsu, Fuad A. Iraqi, Boris Ivandic, Howard J. Jacob, Ritsert C. Jansen, Karl J. Jepsen, Dabney K. Johnson, Thomas E. Johnson, Gerd Kempermann, Christina Kendziorski, Malak Kotb, R. Frank Kooy, Bastien Llamas, Frank Lammert, Jean- Michel M. Lassalle, Pedro R. Lowenstein, Lu Lu, Aldons Lusis, Ken- neth F. Manly, Ralph Marcucio, Doug Matthews, Juan F. Medrano, Darla R. Miller, Guy Mittleman, Beverly A. Mock, Jeffrey S. Mogil, Xavier Montagutelli, Grant Morahan, David G. Morris, Richard Mott, Joseph H. Nadeau, Hiroki Nagase, Richard S. Nowakowski, Bruce F. O’Hara, Alexander V. Osadchuk, Grier P. Page, Beverly Paigen, Ken- neth Paigen, Abraham A. Palmer, Huei-Ju J. Pan, Leena Peltonen- Palotie, Jeremy Peirce, Daniel Pomp, Michal Pravenec, Daniel R. Prows, Zhonghua Qi, Roger H. Reeves, John Roder, Glenn D. Rosen, Eric E. Schadt, Leonard C. Schalkwyk, Ze’ev Seltzer, Kazuhiro Shimomura, Siming Shou, Mikko J. Sillanpää, Linda D. Siracusa, Hans-Willem W. Snoeck, Jimmy L. Spearow, Karen Svenson, Lisa M. Tarantino, David Threadgill, Linda A. Toth, William Valdar, Fernando Pardo-Manuel de Villena, Craig Warden, Steve Whatley, Robert W. Williams, Tim Wilt- shire, Nengjun Yi, Dabao Zhang, Min Zhang, Fei Zou, and Complex Trait

250 Consortium. “The Collaborative Cross, a community resource for the ge- netic analysis of complex traits.” Nat Genet, 36(11):1133–7, 11 2004.

[CDL13] Y. C. Chen, G. H. Dong, K. C. Lin, and Y. L. Lee. “Gender difference of childhood overweight and obesity in predicting the risk of incident asthma: a systematic review and meta-analysis.” Obes Rev, 14(3):222– 31, 3 2013.

[CES07] Lin Chen, Frank Emmert-Streib, and John Storey. “Harnessing naturally randomized transcription to infer regulatory relationships among genes.” Genome Biology, 8:R219, 2007.

[CIM08] Elisa Ciraolo, Manuela Iezzi, Romina Marone, Stefano Marengo, Clau- dia Curcio, Carlotta Costa, Ornella Azzolino, Cristiano Gonella, Cristina Rubinetto, Haiyan Wu, Walter Dastrù, Erica L. Martin, Lorenzo Silengo, Fiorella Altruda, Emilia Turco, Letizia Lanzetti, Piero Musiani, Thomas Rückle, Christian Rommel, Jonathan M. Backer, Guido Forni, Matthias P. Wymann, and Emilio Hirsch. “Phosphoinositide 3-kinase p110beta activ- ity: key role in metabolism and mammary gland cancer but not develop- ment.” Sci Signal, 1(36):ra3, 2008.

[CLS11] Nikoletta Charizopoulou, Andrea Lelli, Margit Schraders, Kausik Ray, Michael S. Hildebrand, Arabandi Ramesh, C. R. Srikumari Srisailapathy, Jaap Oostrik, Ronald J. C. Admiraal, Harold R. Neely, Joseph R. Latoche, Richard J. H. Smith, John K. Northup, Hannie Kremer, Jeffrey R. Holt, and Konrad Noben-Trauth. “Gipc3 mutations associated with audiogenic seizures and sensorineural hearing loss in mouse and human.” Nat Com- mun, 2:201, 2011.

[Coc09] William G. Cochran. The Combination of Estimates from Different Ex- periments. 4 2009.

[DL86a] R. DerSimonian and N. Laird. “Meta-analysis in clinical trials.” Control Clin Trials, 7(3):177–88, 9 1986.

[DL86b] R. DerSimonian and N. Laird. “Meta-analysis in clinical trials.” Control Clin Trials, 7(3):177–88, 9 1986.

[DLW90] M. H. Doolittle, R. C. LeBoeuf, C. H. Warden, L. M. Bee, and A. J. Lusis. “A polymorphism affecting apolipoprotein A-II translational effi- ciency determines high density lipoprotein size and composition.” J Biol Chem, 265(27):16380–8, 9 1990.

251 [DN06] Meghan Drayton and Konrad Noben-Trauth. “Mapping quantitative trait loci for hearing loss in Black Swiss mice.” Hear Res, 212(1-2):128–39, 2 2006.

[DNB13] Richard C. Davis, Atila van Nas, Brian Bennett, Luz Orozco, Calvin Pan, Christoph D. Rau, Eleazar Eskin, and Aldons J. Lusis. “Genome-wide association mapping of blood cell traits in mice.” Mamm Genome, 24(3- 4):105–18, 4 2013.

[DNC12] Richard C. Davis, Atila van Nas, Lawrence W. Castellani, Yi Zhao, Zhiqiang Zhou, Pingzi Wen, Suzanne Yu, Hongxiu Qi, Melenie Rosales, Eric E. Schadt, Karl W. Broman, Miklós Péterfy, and Aldons J. Lusis. “Systems genetics of susceptibility to obesity-induced diabetes in mice.” Physiol Genomics, 44(1):1–13, 1 2012.

[DR99] B. Devlin and K. Roeder. “Genomic control for association studies.” Bio- metrics, 55(4):997–1004, 12 1999.

[DRB01] B. Devlin, Kathryn. Roeder, and Silviu-Alin Bacanu. “Unbiased methods for population-based association studies.” Genet Epidemiol, 21(4):273– 84, 12 2001.

[DRW01] B. Devlin, Kathryn. Roeder, and Larry. Wasserman. “Genomic control, a new approach to genetic-based association studies.” Theor Popul Biol, 60(3):155–66, 11 2001.

[DWH13] Xiayun Dai, Chen Wu, Yunfeng He, Lixuan Gui, Li Zhou, Huan Guo, Jing Yuan, Binyao Yang, Jun Li, Qifei Deng, Suli Huang, Lei Guan, Die Hu, Jiang Zhu, Xinwen Min, Mingjian Lang, Dongfeng Li, Han- dong Yang, Frank B. Hu, Dongxin Lin, Tangchun Wu, and Meian He. “A Genome-Wide Association Study for Serum Bilirubin Levels and Gene- Environment Interaction in a Chinese Population.” Genet Epidemiol, 1 2013.

[EBS11] Andrew C. Edmondson, Peter S. Braund, Ioannis M. Stylianou, Amit V. Khera, Christopher P. Nelson, Megan L. Wolfe, Stephanie L. Derohan- nessian, Brendan J. Keating, Liming Qu, Jing He, Martin D. Tobin, Ma- ciej Tomaszewski, Jens Baumert, Norman Klopp, Angela Döring, Barbara Thorand, Mingyao Li, Muredach P. Reilly, Wolfgang Koenig, Nilesh J. Samani, and Daniel J. Rader. “Dense genotyping of candidate gene loci identifies variants associated with high-density lipoprotein cholesterol.” Circ Cardiovasc Genet, 4(2):145–55, 4 2011.

252 [ECW04] D. Estrada-Smith, L.W. Castellani, H. Wong, P.Z. Wen, A. Chui, A.J. Lu- sis, and R.C. Davis. “Dissection of multigenic obesity traits in congenic mouse strains.” Mammalian Genome, 15(1):14–22, 2004.

[EM07] Daniel Eaton and Kevin Murphy. “Bayesian structure learning using dy- namic programming and MCMC.” In Uncertainty in Artificial Intelli- gence (UAI), 2007.

[EMI07] Evangelos Evangelou, Demetrius M. Maraganore, and John P. A. Ioanni- dis. “Meta-analysis in genome-wide association datasets: strategies and application in Parkinson disease.” PLoS One, 2(2):e196, 2007.

[EW08] Byron Ellis and Wing Hung Wong. “Learning Causal Bayesian Network Structures From Experimental Data.” Journal of the American Statistical Association, 103:778–789, 2008.

[FAK03] Takahiro Fujino, Hiroshi Asaba, Man-Jong J. Kang, Yukio Ikeda, Hideyuki Sone, Shinji Takada, Dong-Ho H. Kim, Ryoichi X. Ioka, Masao Ono, Hiroko Tomoyori, Minoru Okubo, Toshio Murase, Akihisa Ka- mataki, Joji Yamamoto, Kenta Magoori, Sadao Takahashi, Yoshiharu Miyamoto, Hisashi Oishi, Masato Nose, Mitsuyo Okazaki, Shinichi Usui, Katsumi Imaizumi, Masashi Yanagisawa, Juro Sakai, and Tokuo T. Ya- mamoto. “Low-density lipoprotein receptor-related protein 5 (LRP5) is essential for normal cholesterol metabolism and glucose-induced insulin secretion.” Proc Natl Acad Sci U S A, 100(1):229–34, 1 2003.

[FBO11] Charles R. Farber, Brian J. Bennett, Luz Orozco, Wei Zou, Ana Lira, Em- rah Kostem, Hyun Min Kang, Nicholas Furlotte, Ani Berberyan, Anatole Ghazalpour, Jaijam Suwanwela, Thomas A. Drake, Eleazar Eskin, Q. Tian Wang, Steven L. Teitelbaum, and Aldons J. Lusis. “Mouse genome-wide association and systems genetics identify asxl2 as a regulator of bone min- eral density and osteoclastogenesis.” PLoS Genet, 7(4):e1002038, 4 2011.

[FE12] Jonathan Flint and Eleazar Eskin. “Genome-wide association studies in mice.” Nature Reviews Genetics, 13(11):807, 10 2012.

[FET12] Jennifer K. Forsyth, Lauren M. Ellman, Antti Tanskanen, Ulla Mustonen, Matti O. Huttunen, Jaana Suvisaari, and Tyrone D. Cannon. “Genetic Risk for Schizophrenia, Obstetric Complications, and Adolescent School Outcome: Evidence for Gene-Environment Interaction.” Schizophr Bull, 9 2012.

[FK03] and . “Being Bayesian about Network Struc- ture.” In Machine Learning, pp. 95–125, 2003.

253 [FKV12] Nicholas A. Furlotte, Eun Yong Kang, Atila Van Nas, Charles R. Far- ber, Aldons J. Lusis, and Eleazar Eskin. “Increasing association mapping power and resolution in mouse genetic studies through the use of meta- analysis for structured populations.” Genetics, 191(3):959–67, 7 2012.

[FLN00] N. Friedman, M. Linial, I. Nachman, and D. Pe’er. “Using Bayesian Net- works to Analyze Expression Data.” Journal of Computational Biology, 7(3-4):601–620, 2000.

[FLW12] Caroline S. Fox, Yongmei Liu, Charles C. White, Mary Feitosa, Albert V. Smith, Nancy Heard-Costa, Kurt Lohman, Andrew D. Johnson, Mered- ith C. Foster, Danielle M. Greenawalt, Paula Griffin, Jinghong Ding, Anne B. Newman, Fran Tylavsky, Iva Miljkovic, Stephen B. Kritchevsky, Lenore Launer, Melissa Garcia, Gudny Eiriksdottir, J. Jeffrey Carr, Vil- munder Gudnason, Tamara B. Harris, L. Adrienne Cupples, and Ingrid B. Borecki. “Genome-wide association for abdominal subcutaneous and vis- ceral adipose reveals a novel locus for visceral fat in women.” PLoS Genet, 8(5):e1002695, 2012.

[FM01] J. Flint and R. Mott. “Finding the molecular basis of quantitative traits: successes and pitfalls.” Nat Rev Genet, 2(6):437–45, 6 2001.

[FNG09] Charles R. Farber, Atila van Nas, Anatole Ghazalpour, Jason E. Aten, Sudheer Doss, Brandon Sos, Eric E. Schadt, Leslie Ingram-Drake, Richard C. Davis, Steve Horvath, Desmond J. Smith, Thomas A. Drake, and Aldons J. Lusis. “An integrative genetics approach to identify can- didate genes regulating BMD: combining linkage, gene expression, and association.” J Bone Miner Res, 24(1):105–16, 1 2009.

[FPC96] C. Y. Fan, J. Pan, R. Chu, D. Lee, K. D. Kluckman, N. Usuda, I. Singh, A. V. Yeldandi, M. S. Rao, N. Maeda, and J. K. Reddy. “Hepatocellular and hepatic peroxisomal alterations in mice with a disrupted peroxisomal fatty acyl-coenzyme A oxidase gene.” J Biol Chem, 271(40):24698–710, 10 1996.

[FRH13] Ombretta Foresti, Annamaria Ruggiano, Hans K. Hannibal-Bach, Chris- ter S. Ejsing, and Pedro Carvalho. “Sterol homeostasis requires reg- ulated degradation of squalene monooxygenase by the ubiquitin ligase Doa10/Teb4.” eLife, 2, 2013.

[Fri04] N. Friedman. “Inferring Cellular Networks Using Probabilistic Graphical Models.” Science’s STKE, 303(5659):799, 2004.

254 [FSY07] Robert V. Farese, Mini P. Sajan, Hong Yang, Pengfei Li, Steven Mas- torides, William R. Gower, Sonali Nimal, Cheol Soo Choi, Sheene Kim, Gerald I. Shulman, C. Ronald Kahn, Ursula Braun, and Michael Leitges. “Muscle-specific knockout of PKC-lambda impairs glucose transport and induces metabolic and diabetic syndromes.” J Clin Invest, 117(8):2289– 301, 8 2007. [FVH09] Rick A. Friedman, Lut Van Laer, Matthew J. Huentelman, Sonal S. Sheth, Els Van Eyken, Jason J. Corneveaux, Waibhav D. Tembe, Rebecca F. Halperin, Ashley Q. Thorburn, Sofie Thys, Sarah Bonneux, Erik Fransen, Jeroen Huyghe, Ilmari Pyykkö, Cor W. R. J. Cremers, Hannie Kremer, Ingeborg Dhooge, Dafydd Stephens, Eva Orzan, Markus Pfister, Michael Bille, Agnete Parving, Martti Sorri, Paul H. Van de Heyning, Linna Mak- mura, Jeffrey D. Ohmen, Frederick H. Linthicum, Jose N. Fayad, John V. Pearson, David W. Craig, Dietrich A. Stephan, and Guy Van Camp. “GRM7 variants confer susceptibility to age-related hearing impairment.” Hum Mol Genet, 18(4):785–96, 2 2009. [GDZ06] Anatole Ghazalpour, Sudheer Doss, Bin Zhang, Susanna Wang, Christo- pher Plaisier, Ruth Castellanos, Alec Brozell, Eric Schadt, Thomas Drake, Aldons Lusis, and Steve Horvath. “Integrating Genetic and Network Analysis to Characterize Genes Related to Mouse Weight.” PLoS Genet, 2, 2006. [GH94] Dan Geiger and David Heckerman. “Learning Gaussian Networks.” Technical Report MSR-TR-94-10, Redmond, WA, 1994. [GKL11] Alexander Gusev, Eimear E. Kenny, Jennifer K. Lowe, Jaqueline Salit, Richa Saxena, Sekar Kathiresan, David M. Altshuler, Jeffrey M. Fried- man, Jan L. Breslow, and Itsik Pe’er. “DASH: a method for identical-by- descent haplotype mapping uncovers association with recent variation.” Am J Hum Genet, 88(6):706–17, 6 2011. [GLR10] Justin Gerke, Kim Lorenz, Shelina Ramnarine, and Barak Cohen. “Gene- environment interactions at nucleotide resolution.” PLoS Genet, 6(9), 9 2010. [GLS09] Alexander Gusev, Jennifer K. Lowe, Markus Stoffel, Mark J. Daly, David Altshuler, Jan L. Breslow, Jeffrey M. Friedman, and Itsik Pe’er. “Whole population, genome-wide mapping of hidden relatedness.” Genome Res, 19(2):318–26, 2 2009. [GM05] George A. Gates and John H. Mills. “Presbycusis.” Lancet, 366(9491):1111–20, 2005.

255 [GNS12] Jianjun Gao, Michael A. Nalls, Min Shi, Bonnie R. Joubert, Dena G. Hernandez, Xuemei Huang, Albert Hollenbeck, Andrew B. Singleton, and Honglei Chen. “An exploratory analysis on gene-environment inter- actions for Parkinson disease.” Neurobiol Aging, 33(10):2528.e1–6, 10 2012.

[GPJ97] S. Gupta, A. M. Pablo, X. c. Jiang, N. Wang, A. R. Tall, and C. Schindler. “IFN-gamma potentiates atherosclerosis in ApoE knock-out mice.” J Clin Invest, 99(11):2752–61, 6 1997.

[GRF12] Anatole Ghazalpour, Christoph D. Rau, Charles R. Farber, Brian J. Ben- nett, Luz D. Orozco, Atila van Nas, Calvin Pan, Hooman Allayee, Si- mon W. Beaven, Mete Civelek, Richard C. Davis, Thomas A. Drake, Rick A. Friedman, Nick Furlotte, Simon T. Hui, J. David Jentsch, Em- rah Kostem, Hyun Min Kang, Eun Yong Kang, Jong Wha Joo, Vy- acheslav A. Korshunov, Rick E. Laughlin, Lisa J. Martin, Jeffrey D. Ohmen, Brian W. Parks, Matteo Pellegrini, Karen Reue, Desmond J. Smith, Sotirios Tetradis, Jessica Wang, Yibin Wang, James N. Weiss, Todd Kirchgessner, Peter S. Gargalovic, Eleazar Eskin, Aldons J. Lusis, and Renée C. LeBoeuf. “Hybrid mouse diversity panel: a panel of inbred mouse strains suitable for analysis of complex genetic traits.” Mamm Genome, 23(9-10):680–92, 10 2012.

[HBM12] Congcong He, Michael C. Bassik, Viviana Moresi, Kai Sun, Yongjie Wei, Zhongju Zou, Zhenyi An, Joy Loh, Jill Fisher, Qihua Sun, Stanley Ko- rsmeyer, Milton Packer, Herman I. May, Joseph A. Hill, Herbert W. Vir- gin, Christopher Gilpin, Guanghua Xiao, Rhonda Bassel-Duby, Philipp E. Scherer, and Beth Levine. “Exercise-induced BCL2-regulated autophagy is required for muscle glucose homeostasis.” Nature, 481(7382):511–5, 1 2012.

[HDK00] R. Hitzemann, K. Demarest, J. Koyner, L. Cipp, N. Patel, E. Rasmussen, and J. McCaughran. “Effect of genetic cross on the detection of quan- titative trait loci and a novel approach to mapping QTLs.” Pharmacol Biochem Behav, 67(4):767–72, 12 2000.

[HE11a] B. Han and E. Eskin. “Random-Effects Model Aimed at Discovering As- sociations in Meta-Analysis of Genome-wide Association Studies.” The American Journal of Human Genetics, 88(5):586–598, 2011.

[HE11b] Buhm Han and Eleazar Eskin. “Random-effects model aimed at discov- ering associations in meta-analysis of genome-wide association studies.” Am J Hum Genet, 88(5):586–98, 5 2011.

256 [HE12] Buhm Han and Eleazar Eskin. “Interpreting meta-analyses of genome- wide association studies.” PLoS Genet, 8(3):e1002555, 2012.

[HGJ01] A.J. Hartemink, D.K. Gifford, T.S. Jaakkola, and R.A. Young. “Using graphical models and genomic expression data to statistically validate models of genetic regulatory networks.” In Pac. Symp. Biocomput, vol- ume 6, pp. 422–433, 2001.

[HKE09a] Buhm Han, Hyun Min Kang, and Eleazar Eskin. “Rapid and accurate multiple testing correction and power estimation for millions of correlated markers.” PLoS Genet, 5(4):e1000456, 4 2009.

[HKE09b] Buhm Han, Hyun Min Kang, and Eleazar Eskin. “Rapid and accurate multiple testing correction and power estimation for millions of correlated markers.” PLoS Genet, 5(4):e1000456, 4 2009.

[HMC02] R. Hitzemann, B. Malmanger, S. Cooper, S. Coulombe, C. Reed, K. De- marest, J. Koyner, L. Cipp, J. Flint, C. Talbot, et al. “Multiple Cross Map- ping (MCM) markedly improves the localization of a QTL for ethanol- induced activation.” Genes, Brain and Behavior, 1(4):214–222, 2002.

[HMK13] David A. Hinds, George McMahon, Amy K. Kiefer, Chuong B. Do, Nicholas Eriksson, David M. Evans, Beate St Pourcain, Susan M. Ring, Joanna L. Mountain, Uta Francke, George Davey-Smith, Nicholas J. Timpson, and Joyce Y. Tung. “A genome-wide association meta-analysis of self-reported allergy identifies shared and allergy-specific susceptibility loci.” Nat Genet, 6 2013.

[HSV09] Guo-Jen J. Huang, Sagiv Shifman, William Valdar, Martina Johannesson, Binnaz Yalcin, Martin S. Taylor, Jennifer M. Taylor, Richard Mott, and Jonathan Flint. “High resolution mapping of expression QTLs in hetero- geneous stock mice in multiple tissues.” Genome Res, 19(6):1133–40, 6 2009.

[HT96] Rebecca. J. Hardy and Simon. G. Thompson. “A likelihood approach to meta-analysis with random effects.” Stat Med, 15(6):619–29, 3 1996.

[HT02] Julian P. T. Higgins and Simon G. Thompson. “Quantifying heterogeneity in a meta-analysis.” Stat Med, 21(11):1539–58, 6 2002.

[HT10] Qi Huang and Jianguo Tang. “Age-related hearing loss or presbycusis.” Eur Arch Otorhinolaryngol, 267(8):1179–91, 8 2010.

257 [HZK10] Jennifer J. Hofmann, Ann C. Zovein, Huilin Koh, Freddy Radtke, Gerry Weinmaster, and M. Luisa Iruela-Arispe. “Jagged1 in the portal vein mes- enchyme regulates intrahepatic bile duct development: insights into Alag- ille syndrome.” Development, 137(23):4061–72, 12 2010.

[Int05] International HapMap Consortium. “A haplotype map of the human genome.” Nature, 437(7063):1299–320, 10 2005.

[IPE07a] John P. A. Ioannidis, Nikolaos A. Patsopoulos, and Evangelos Evangelou. “Heterogeneity in meta-analyses of genome-wide association investiga- tions.” PLoS One, 2(9):e841, 2007.

[IPE07b] John P. A. Ioannidis, Nikolaos A. Patsopoulos, and Evangelos Evan- gelou. “Uncertainty in heterogeneity estimates in meta-analyses.” BMJ, 335(7626):914–6, 11 2007.

[JAW92] X. C. Jiang, L. B. Agellon, A. Walsh, J. L. Breslow, and A. Tall. “Di- etary cholesterol increases transcription of the human cholesteryl ester transfer protein gene in transgenic mice. Dependence on natural flanking sequences.” J Clin Invest, 90(4):1290–5, 10 1992.

[JEC97] K. R. Johnson, L. C. Erway, S. A. Cook, J. F. Willott, and Q. Y. Zheng. “A major gene affecting age-related hearing loss in C57BL/6J mice.” Hear Res, 114(1-2):83–92, 12 1997.

[JGL12] Kenneth R. Johnson, Leona H. Gagnon, Chantal Longo-Guess, and Kelly L. Kane. “Association of a citrate synthase missense mutation with age-related hearing loss in A/J mice.” Neurobiol Aging, 33(8):1720–9, 8 2012.

[JLG08] Kenneth R. Johnson, Chantal Longo-Guess, Leona H. Gagnon, Heping Yu, and Qing Yin Zheng. “A locus on distal chromosome 11 (ahl8) and its interaction with Cdh23 ahl underlie the early onset, age-related hearing loss of DBA/2J mice.” Genomics, 92(4):219–25, 10 2008.

[JW02] M. I. Jordan and Y. Weiss. “Graphical Models: Probabilistic Inference.” In M. Arbib, editor, The Handbook of Brain Theory and Neural Networks, 2nd edition. Cambridge, MA: MIT Press, 2002.

[JZB01] K. R. Johnson, Q. Y. Zheng, Y. Bykhovskaya, O. Spirina, and N. Fischel- Ghodsian. “A nuclear-mitochondrial DNA interaction affecting hearing impairment in mice.” Nat Genet, 27(2):191–4, 2 2001.

258 [JZE00] K. R. Johnson, Q. Y. Zheng, and L. C. Erway. “A major gene affecting age-related hearing loss is common to at least ten inbred strains of mice.” Genomics, 70(2):171–80, 12 2000.

[JZW05] K. R. Johnson, Q. Y. Zheng, M. D. Weston, L. J. Ptacek, and K. Noben- Trauth. “The Mass1frings mutation underlies early onset hearing impair- ment in BUB/BnJ mice, a model for the auditory pathology of Usher syn- drome IIC.” Genomics, 85(5):582–90, 5 2005.

[KAI07] Insook Kim, Sung-Hoon H. Ahn, Takeshi Inagaki, Mihwa Choi, Shinji Ito, Grace L. Guo, Steven A. Kliewer, and Frank J. Gonzalez. “Differen- tial regulation of bile acid homeostasis by the farnesoid X receptor in liver and intestine.” J Lipid Res, 48(12):2664–72, 12 2007.

[KCD12] William J. Kostis, Jerry Q. Cheng, Jeanne M. Dobrzynski, Javier Cabrera, and John B. Kostis. “Meta-analysis of statin effects in women versus men.” J Am Coll Cardiol, 59(6):572–82, 2 2012.

[KCS13] Ai Kubo, Michael Blaise Cook, Nicholas J. Shaheen, Thomas L. Vaughan, David C. Whiteman, Liam Murray, and Douglas A. Corley. “Sex-specific associations between body mass index, waist circumference and the risk of Barrett’s oesophagus: a pooled analysis from the international BEA- CON consortium.” Gut, 1 2013.

[KGT08] Yukio Kawahara, Adda Grimberg, Sarah Teegarden, Cedric Mombereau, Sui Liu, Tracy L. Bale, Julie A. Blendy, and Kazuko Nishikura. “Dysreg- ulated editing of serotonin 2C receptor mRNAs results in energy dissipa- tion and loss of fat mass.” J Neurosci, 28(48):12834–44, 11 2008.

[KJ06] David Kulp and Manjunatha Jagalur. “Causal inference of regulator- target pairs by gene mapping of expression phenotypes.” BMC Genomics, 7:125, 2006.

[KK89] T. Kamada and S. Kawai. “An algorithm for drawing general undirected graphs.” Inf. Process. Lett., 31(1):7–15, 1989.

[KKW10a] Andrew Kirby, Hyun Min Kang, Claire M. Wade, Chris Cotsapas, Em- rah Kostem, Buhm Han, Nick Furlotte, Eun Yong Kang, Manuel Rivas, Molly A. Bogue, Kelly A. Frazer, Frank M. Johnson, Erica J. Beilharz, David R. Cox, Eleazar Eskin, and Mark J. Daly. “Fine mapping in 94 inbred mouse strains using a high-density haplotype resource.” Genetics, 185(3):1081–95, 7 2010.

259 [KKW10b] Andrew Kirby, Hyun Min Kang, Claire M. Wade, Chris J. Cotsapas, Em- rah Kostem, Buhm Han, Nick Furlotte, Eun Yong Kang, Manuel Rivas, Molly A. Bogue, Kelly A. Frazer, Frank M. Johnson, Erica J. Beilharz, David R. Cox, Eleazar Eskin, and Mark J. Daly. “Fine Mapping in 94 Inbred Mouse Strains Using a High-density Haplotype Resource.” Ge- netics, 5 2010.

[KN12] James M. Keller and Konrad Noben-Trauth. “Genome-wide linkage anal- yses identify Hfhl1 and Hfhl3 with frequency-specific effects on the hear- ing spectrum of NIH Swiss mice.” BMC Genet, 13:32, 2012.

[KNL11] James M. Keller, Harold R. Neely, Joseph R. Latoche, and Konrad Noben- Trauth. “High-frequency sensorineural hearing loss and its underlying genetics (Hfhl1 and Hfhl2) in NIH Swiss mice.” J Assoc Res Otolaryngol, 12(5):617–31, 10 2011.

[Koi06] Mikko Koivisto. “Advances in exact Bayesian structure discovery in Bayesian networks.” In Uncertainty in Artificial Intelligence (UAI), 2006.

[KS04] Mikko Koivisto and Kismat Sood. “Exact Bayesian structure discovery in Bayesian networks.” In Journal of Machine Learning Research, pp. 594–573, 2004.

[KSS10] Hyun Min Kang, Jae Hoon Sul, Susan K. Service, Noah A. Zaitlen, Sit-Yee Y. Kong, Nelson B. Freimer, Chiara Sabatti, and Eleazar Eskin. “Variance component model to account for sample structure in genome- wide association studies.” Nat Genet, 42(4):348–54, 4 2010.

[KSY09] Eun Yong Kang, Ilya Shpitser, Chun Ye, and Eleazar Eskin. “Detecting the Presence and Absence of Causal Relationships Between Expression of Yeast Genes with Very Few Samples.” In Thirteenth Annual Conference on Research in Computational Molecular Biology, pp. 466–481, 2009.

[KZW08] Hyun Min Kang, Noah A. Zaitlen, Claire M. Wade, Andrew. Kirby, David Heckerman, Mark J. Daly, and Eleazar Eskin. “Efficient control of pop- ulation structure in model organism association mapping.” Genetics, 178(3):1709, 2008.

[Lan02] Kenneth. Lange. Mathematical and statistical methods for genetic analy- sis. Springer Verlag, 2002.

[LB94] Wai Lam and Fahiem Bacchus. “Learning Bayesian Belief Networks: An Approach Based on the MDL Principle.” Computational Intelligence, 10(4), 1994.

260 [LCD11] R. Liegel, B. Chang, R. Dubielzig, and D. J. Sidjanin. “Blind sterile 2 (bs2), a hypomorphic mutation in Agps, results in cataracts and male sterility in mice.” Mol Genet Metab, 103(1):51–9, 5 2011.

[LEP09] Lei O. Li, Jessica M. Ellis, Heather A. Paich, Shuli Wang, Nan Gong, George Altshuller, Randy J. Thresher, Timothy R. Koves, Steven M. Watkins, Deborah M. Muoio, Gary W. Cline, Gerald I. Shulman, and Rosalind A. Coleman. “Liver-specific loss of long chain acyl-CoA synthetase-1 decreases triacylglycerol synthesis and beta- oxidation and alters phospholipid fatty acid composition.” J Biol Chem, 284(41):27816–26, 10 2009.

[LGP12] Rebecca L. Leshan, Megan Greenwald-Yarnell, Christa M. Patterson, Ian E. Gonzalez, and Martin G. Myers. “Leptin action through hypothala- mic nitric oxide synthase-1-expressing neurons controls energy balance.” Nat Med, 18(5):820–3, 5 2012.

[LLC07] Shih-Ping P. Liu, Ying-Shiuan S. Li, Yann-Jang J. Chen, En-Pei P. Chi- ang, Anna Fen-Yau Li, Ying-Hue H. Lee, Ting-Fen F. Tsai, Michael Hsiao, Shiu-Feng F. Huang, Shiu-Feng F. Hwang, and Yi-Ming Arthur M. Chen. “Glycine N-methyltransferase-/- mice develop chronic hepatitis and glycogen storage disease in the liver.” Hepatology, 46(5):1413–25, 11 2007.

[LLK13] Jennifer Listgarten, Christoph Lippert, Eun Yong Kang, Jing Xiang, Carl M Kadie, and David Heckerman. “A powerful and efficient set test for genetic markers that handles confounders.” Bioinformatics, 29(12):1526–1533, 2013.

[LLW05] Renhua Li, Malcolm A. Lyons, Henning Wittenburg, Beverly Paigen, and Gary A. Churchill. “Combining Data From Multiple Inbred Line Crosses Improves the Power and Resolution of Quantitative Trait Loci Mapping.” Genetics, 169(3):1699–1709, 3 2005.

[LNN11] Joseph R. Latoche, Harold R. Neely, and Konrad Noben-Trauth. “Poly- genic inheritance of sensorineural hearing loss (Snhl2, -3, and -4) and organ of Corti patterning defect in the ALR/LtJ mouse strain.” Hear Res, 275(1-2):150–9, 5 2011.

[LPD06] Su-In Lee, Dana Pe’er, Aimee M. Dudley, George M. Church, and Daphne Koller. “Identifying regulatory mechanisms using individual variation re- veals key role for chromatin modification.” Proceedings of the National Academy of Sciences, 103:14062–14067, September 2006.

261 [LQK13] Christoph Lippert, Gerald Quon, Eun Yong Kang, Carl M. Kadie, Jennifer Listgarten, and David Heckerman. “The benefits of selecting phenotype- specific variants for applications of mixed models in genomics.” Scientific Reports, 3, 2013.

[LS09] Dan-Yu Y. Lin and Patrick F. Sullivan. “Meta-analysis of genome- wide association studies with overlapping subjects.” Am J Hum Genet, 85(6):862–72, 12 2009.

[LWD00] K. Lindblad-Toh, E. Winchester, M. J. Daly, D. G. Wang, J. N. Hirschhorn, J. P. Laviolette, K. Ardlie, D. E. Reich, E. Robinson, P. Sklar, N. Shah, D. Thomas, J. B. Fan, T. Gingeras, J. Warrington, N. Patil, T. J. Hudson, and E. S. Lander. “Large-scale discovery and genotyping of single-nucleotide polymorphisms in the mouse.” Nat Genet, 24(4):381– 6, 4 2000.

[LWG11] Sang Hong Lee, Naomi R. Wray, Michael E. Goddard, and Peter M. Viss- cher. “Estimating missing heritability for disease from genome-wide as- sociation studies.” Am J Hum Genet, 88(3):294–305, 3 2011.

[MBB10] Hugh Morgan, Tim Beck, Andrew Blake, Hilary Gates, Niels Adams, Guillaume Debouzy, Sophie Leblanc, Christoph Lengger, Holger Maier, David Melvin, Hamid Meziane, Dave Richardson, Sara Wells, Jacqui White, Joe Wood, Martin Hrabé de Angelis, Steve D. M. Brown, John M. Hancock, and Ann-Marie M. Mallon. “EuroPhenome: a repos- itory for high-throughput mouse phenotyping data.” Nucleic Acids Res, 38(Database issue):D577–85, 1 2010.

[MCC09] Teri A. Manolio, Francis S. Collins, Nancy J. Cox, David B. Goldstein, Lucia A. Hindorff, David J. Hunter, Mark I. McCarthy, Erin M. Ramos, Lon R. Cardon, Aravinda Chakravarti, Judy H. Cho, Alan E. Guttmacher, Augustine Kong, Leonid Kruglyak, Elaine Mardis, Charles N. Ro- timi, Montgomery Slatkin, David Valle, Alice S. Whittemore, Michael Boehnke, Andrew G. Clark, Evan E. Eichler, Greg Gibson, Jonathan L. Haines, Trudy F. C. Mackay, Steven A. McCarroll, and Peter M. Viss- cher. “Finding the missing heritability of complex diseases.” Nature, 461(7265):747–53, 10 2009.

[MF13] Richard Mott and Jonathan Flint. “Dissecting Quantitative Traits in Mice.” Annu Rev Genomics Hum Genet, 7 2013.

[MGP09] G. Manenti, A. Galvan, A. Pettinicchio, G. Trincucci, E. Spada, A. Zolin, S. Milani, A. Gonzalez-Neira, and T.A. Dragani. “Mouse genome-wide

262 association mapping needs linkage analysis to avoid false-positive loci.” PLoS Genetics, 5(1):e1000331, 2009.

[MHK93] A. M. van den Maagdenberg, M. H. Hofker, P. J. Krimpenfort, I. de Bruijn, B. van Vlijmen, H. van der Boom, L. M. Havekes, and R. R. Frants. “Transgenic mice carrying the apolipoprotein E3-Leiden gene exhibit hy- perlipoproteinemia.” J Biol Chem, 268(14):10540–5, 5 1993.

[MHM07] Jonathan Marchini, Bryan Howie, Simon Myers, Gil McVean, and Peter Donnelly. “A new multipoint method for genome-wide association stud- ies by imputation of genotypes.” Nat Genet, 39(7):906–13, 7 2007.

[ML12] Barbara J. Mason and Philippe Lehert. “Acamprosate for alcohol depen- dence: a sex-specific meta-analysis based on individual patient data.” Al- cohol Clin Exp Res, 36(3):497–508, 3 2012.

[MLL11] Alisa K. Manning, Michael LaValley, Ching-Ti . T. Liu, Ken Rice, Ping An, Yongmei Liu, Iva Miljkovic, Laura Rasmussen-Torvik, Tamara B. Harris, Michael A. Province, Ingrid B. Borecki, Jose C. Florez, James B. Meigs, L. Adrienne Cupples, and Josée Dupuis. “Meta-analysis of Gene- Environment interaction: joint estimation of SNP and SNP×Environment regression coefficients.” Genetic Epidemiology, 35(1):11, 1 2011.

[MLM10] Reedik Magi, Cecilia M. Lindgren, and Andrew P. Morris. “Meta- analysis of sex-specific genome-wide association studies.” Genet Epi- demiol, 34(8):846–53, 12 2010.

[MMF12] Mariana Murea, Lijun Ma, and Barry I. Freedman. “Genetic and envi- ronmental factors associated with type 2 diabetes and diabetic vascular complications.” Rev Diabet Stud, 9(1):6–22, 2012.

[MS07] F. Markowetz and R. Spang. “Inferring cellular networks–a review.” BMC Bioinformatics, 8(Suppl 6):S5, 2007.

[MXX12] Jianzhong Ma, Feifei Xiao, Momiao Xiong, Angeline S. Andrew, Her- mann Brenner, Eric J. Duell, Aage Haugen, Clive Hoggart, Rayjean J. Hung, Philip Lazarus, Changlu Liu, Keitaro Matsuo, Jose Ignacio May- ordomo, Ann G. Schwartz, Andrea Staratschek-Jox, Erich Wichmann, Ping Yang, and Christopher I. Amos. “Natural and orthogonal interaction framework for modeling gene-environment interactions with application to lung cancer.” Hum Hered, 73(4):185–94, 2012.

[MY95] David Madigan and Jeremy York. “Bayesian Graphical Models for Dis- crete Data.” In International Statistical Review, pp. 215–232, 1995.

263 [NGW09] Atila van Nas, Debraj Guhathakurta, Susanna S. Wang, Nadir Yehya, Steve Horvath, Bin Zhang, Leslie Ingram-Drake, Gautam Chaudhuri, Eric E. Schadt, Thomas A. Drake, Arthur P. Arnold, and Aldons J. Lu- sis. “Elucidating the role of gonadal hormones in sexually dimorphic gene coexpression networks.” Endocrinology, 150(3):1235–49, 3 2009. [NIS10] Atila van Nas, Leslie Ingram-Drake, Janet S. Sinsheimer, Susanna S. Wang, Eric E. Schadt, Thomas Drake, and Aldons J. Lusis. “Expression quantitative trait loci: replication, tissue- and sex-specificity in mice.” Ge- netics, 185(3):1059–68, 7 2010. [NNV94] P. M. Nishina, J. K. Naggert, J. Verstuyft, and B. Paigen. “Atherosclerosis in genetically obese mice: the mutants obese, diabetes, fat, tubby, and lethal yellow.” Metabolism, 43(5):554–8, 5 1994. [NTS08] Sei Nakata, Masato Tsutsui, Hiroaki Shimokawa, Osamu Suda, Tsuyoshi Morishita, Kiyoko Shibata, Yasuko Yatera, Ken Sabanai, Akihide Tani- moto, Machiko Nagasaki, Hiromi Tasaki, Yasuyuki Sasaguri, Yasuhide Nakashima, Yutaka Otsuji, and Nobuyuki Yanagihara. “Spontaneous my- ocardial infarction in mice lacking all nitric oxide synthase isoforms.” Circulation, 117(17):2211–23, 4 2008. [NWW05] Colleen M. Niswender, Brandon S. Willis, Angela Wallen, Ian R. Sweet, Thomas L. Jetton, Brian R. Thompson, Chaodong Wu, Alex J. Lange, and G. Stanley McKnight. “Cre recombinase-dependent expression of a constitutively active mutant allele of the catalytic subunit of protein kinase A.” Genesis, 43(3):109–19, 11 2005. [NZJ03] Konrad Noben-Trauth, Qing Yin Zheng, and Kenneth R. Johnson. “Asso- ciation of cadherin 23 with polygenic inheritance and genetic modification of sensorineural hearing loss.” Nat Genet, 35(1):21–3, 9 2003. [OBF12] Luz D. Orozco, Brian J. Bennett, Charles R. Farber, Anatole Ghazalpour, Calvin Pan, Nam Che, Pingzi Wen, Hong Xiu Qi, Adonisa Mutukulu, Nathan Siemers, Isaac Neuhaus, Roumyana Yordanova, Peter Gargalovic, Matteo Pellegrini, Todd Kirchgessner, and Aldons J. Lusis. “Unraveling inflammatory responses using systems genetics and gene-environment in- teractions in macrophages.” Cell, 151(3):658–70, 10 2012. [PAO97] A. S. Plump, N. Azrolan, H. Odaka, L. Wu, X. Jiang, A. Tall, S. Eisenberg, and J. L. Breslow. “ApoA-I knockout mice: characterization of HDL metabolism in homozygotes and identification of a post-RNA mechanism of apoA-I up-regulation in heterozygotes.” J Lipid Res, 38(5):1033–47, 5 1997.

264 [Pay07] B.A. Payseur et al. “Prospects for association mapping in classical inbred mouse strains.” Genetics, 175(4):1999, 2007.

[PBL07] J.L. Peirce, K.W. Broman, L. Lu, and R.W. Williams. “A simple method for combining genetic mapping data from multiple crosses and experi- mental designs.” PLoS One, 2(10):e1036, 2007.

[PCK13] Chirag J. Patel, Rong Chen, Keiichi Kodama, John P. A. Ioannidis, and Atul J. Butte. “Systematic identification of interaction effects between genome- and environment-wide associations in type 2 diabetes mellitus.” Hum Genet, 1 2013.

[Pea88] Judea Pearl. Probabilistic Reasoning in Intelligent Systems. Morgan and Kaufmann, San Mateo, 1988.

[Pea00] Judea Pearl. Causality: Models, Reasoning, and Inference. Cambridge University Press, 2000.

[PHW13] Sanne A. E. Peters, Rachel R. Huxley, and Mark Woodward. “Com- parison of the Sex-Specific Associations Between Systolic Blood Pres- sure and the Risk of Cardiovascular Disease: A Systematic Review and Meta-Analysis of 124 Cohort Studies, Including 1.2 Million Individuals.” Stroke, 7 2013.

[PMB04] Mathew T. Pletcher, Philip McClurg, Serge Batalov, Andrew I. Su, S. Whitney Barnes, Erica Lagler, Ron Korstanje, Xiaosong Wang, Deb- orah Nusskern, Molly A. Bogue, Richard J. Mural, Beverly Paigen, and Tim Wiltshire. “Use of a dense single nucleotide polymorphism map for in silico mapping in the mouse.” PLoS Biol, 2(12):e393, 12 2004.

[PMP13] Eleonora Porcu, Marco Medici, Giorgio Pistis, Claudia B. Volpato, Scott G. Wilson, Anne R. Cappola, Steffan D. Bos, Joris Deelen, Mar- tin den Heijer, Rachel M. Freathy, Jari Lahti, Chunyu Liu, Lorna M. Lopez, Ilja M. Nolte, Jeffrey R. O’Connell, Toshiko Tanaka, Stella Trompet, Alice Arnold, Stefania Bandinelli, Marian Beekman, Stefan Böhringer, Suzanne J. Brown, Brendan M. Buckley, Clara Camaschella, Anton J. M. de Craen, Gail Davies, Marieke C. H. de Visser, Ian Ford, Tom Forsen, Timothy M. Frayling, Laura Fugazzola, Martin Gögele, An- drew T. Hattersley, Ad R. Hermus, Albert Hofman, Jeanine J. Houwing- Duistermaat, Richard A. Jensen, Eero Kajantie, Margreet Kloppenburg, Ee M. Lim, Corrado Masciullo, Stefano Mariotti, Cosetta Minelli, Brax- ton D. Mitchell, Ramaiah Nagaraja, Romana T. Netea-Maier, Aarno Palotie, Luca Persani, Maria G. Piras, Bruce M. Psaty, Katri Räikkönen,

265 J. Brent Richards, Fernando Rivadeneira, Cinzia Sala, Mona M. Sabra, Naveed Sattar, Beverley M. Shields, Nicole Soranzo, John M. Starr, David J. Stott, Fred C. G. J. Sweep, Gianluca Usala, Melanie M. van der Klauw, Diana van Heemst, Alies van Mullem, Sita H. Vermeulen, W. Ed- ward Visser, John P. Walsh, Rudi G. J. Westendorp, Elisabeth Widen, Guangju Zhai, Francesco Cucca, Ian J. Deary, Johan G. Eriksson, Luigi Ferrucci, Caroline S. Fox, J. Wouter Jukema, Lambertus A. Kiemeney, Peter P. Pramstaller, David Schlessinger, Alan R. Shuldiner, Eline P. Slag- boom, André G. Uitterlinden, Bijay Vaidya, Theo J. Visser, Bruce H. R. Wolffenbuttel, Ingrid Meulenbelt, Jerome I. Rotter, Tim D. Spector, An- drew A. Hicks, Daniela Toniolo, Serena Sanna, Robin P. Peeters, and Silvia Naitza. “A meta-analysis of thyroid-related traits reveals novel loci and gender-specific differences in the regulation of thyroid function.” PLoS Genet, 9(2):e1003266, 2013.

[PNO13] Brian W. Parks, Elizabeth Nam, Elin Org, Emrah Kostem, Frode Norheim, Simon T. Hui, Calvin Pan, Mete Civelek, Christoph D. Rau, Brian J. Bennett, Margarete Mehrabian, Luke K. Ursell, Aiqing He, Lawrence W. Castellani, Bradley Zinker, Mark Kirby, Thomas A. Drake, Christian A. Drevon, Rob Knight, Peter Gargalovic, Todd Kirchgessner, Eleazar Eskin, and Aldons J. Lusis. “Genetic control of obesity and gut microbiota com- position in response to high-fat, high-sucrose diet in mice.” Cell Metab, 17(1):141–52, 1 2013.

[PNT07] Shaun Purcell, Benjamin Neale, Kathe Todd-Brown, Lori Thomas, Manuel A. R. Ferreira, David Bender, Julian Maller, Pamela Sklar, Paul I. W. de Bakker, Mark J. Daly, and Pak C. Sham. “PLINK: a tool set for whole-genome association and population-based linkage analyses.” Am J Hum Genet, 81(3):559–75, 9 2007.

[PPP06] Alkes L. Price, Nick J. Patterson, Robert M. Plenge, Michael E. Wein- blatt, Nancy A. Shadick, David Reich, Alkes L. Price, Nick J. Patter- son, Robert M. Plenge, Michael E. Weinblatt, Nancy A. Shadick, and David Reich. “Principal components analysis corrects for stratification in genome-wide association studies.” Nature Genetics, 38(8):904, 7 2006.

[PRE01] D. Pe’er, A. Regev, G. Elidan, and N. Friedman. “Inferring subnetworks from perturbed expression profiles.”, 2001.

[PV91] Judea Pearl and T. S. Verma. “A Theory of Inferred Causation.” In Prin- ciples of Knowledge Representation and Reasoning: Proceedings of the Second International Conference, pp. 441–452, 1991.

266 [QLM07] Hui-Qi Q. Qu, Yang Lu, Luc Marchand, François Bacot, Rosalie Fréchette, Marie-Catherine C. Tessier, Alexandre Montpetit, and Con- stantin Polychronakos. “Genetic control of alternative splicing in the TAP2 gene: possible implication in the genetics of type 1 diabetes.” Dia- betes, 56(1):270–5, 1 2007.

[QTM97] J. H. Qiao, J. Tripathi, N. K. Mishra, Y. Cai, S. Tripathi, X. P. Wang, S. Imes, M. C. Fishbein, S. K. Clinton, P. Libby, A. J. Lusis, and T. B. Ra- javashisth. “Role of macrophage colony-stimulating factor in atheroscle- rosis: studies of osteopetrotic mice.” Am J Pathol, 150(5):1687–99, 5 1997.

[RGM02] Mark Robinson, Jorg Grigull, Naveed Mohammad, and Timothy Hughes. “FunSpec: a web-based cluster interpreter for yeast.” BMC Bioinformat- ics, 3:35, 2002.

[SB09] Matthew Stephens and David J. Balding. “Bayesian statistical methods for genetic association studies.” Nat Rev Genet, 10(10):681–90, 10 2009.

[SBS03] Amanda Sainsbury, Paul A. Baldock, Christoph Schwarzer, Naohiko Ueno, Ronaldo F. Enriquez, Michelle Couzens, Akio Inui, Herbert Her- zog, and Edith M. Gardiner. “Synergistic effects of Y2 and Y4 receptors on adiposity and bone mass revealed in double knockout mice.” Mol Cell Biol, 23(15):5225–33, 8 2003.

[SGS93] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. Springer Verlag, New York, 1993.

[SGS00] P. Spirtes, C. Glymour, and R. Scheines. Causation, Prediction, and Search. MIT Press, 2000.

[SK08] Erin N. Smith and Leonid Kruglyak. “Gene-environment interaction in yeast gene expression.” PLoS Biol, 6(4):e83, 4 2008.

[SLG10] Jung-Bum B. Shin, Chantal M. Longo-Guess, Leona H. Gagnon, Kather- ine W. Saylor, Rachel A. Dumont, Kateri J. Spinelli, James M. Pagana, Phillip A. Wilmarth, Larry L. David, Peter G. Gillespie, and Kenneth R. Johnson. “The R109H variant of fascin-2, a developmentally regulated actin crosslinker in hair-cell stereocilia, underlies early-onset hearing loss of DBA/2J mice.” J Neurosci, 30(29):9683–94, 7 2010.

[SLY05] Eric E Schadt, John Lamb, Xia Yang, Jun Zhu, Steve Edwards, Debraj GuhaThakurta, Solveig K Sieberts, Stephanie Monks, Marc Reitman, Chunsheng Zhang, Pek Yee Lum, Amy Leonardson, Rolf Thieringer,

267 Joseph M Metzger, Liming Yang, John Castle, Haoyuan Zhu, Shera F Kash, Thomas A Drake, Alan Sachs, and Aldons J Lusis. “An integrative genomics approach to infer causal associations between gene expression and disease.” Nat Genet, 37:710–717, July 2005. [SMD11] Zhan Su, Jonathan Marchini, and Peter Donnelly. “HAPGEN2: simula- tion of multiple disease SNPs.” Bioinformatics, 27(16):2304–5, 8 2011. [SRC09] Nicole Soranzo, Fernando Rivadeneira, Usha Chinappen-Horsley, Ida Malkina, J. Brent Richards, Naomi Hammond, Lisette Stolk, Alexan- dra Nica, Michael Inouye, Albert Hofman, Jonathan Stephens, Eleanor Wheeler, Pascal Arp, Rhian Gwilliam, P. Mila Jhamai, Simon Potter, Amy Chaney, Mohammed J. R. Ghori, Radhi Ravindrarajah, Sergey Ermakov, Karol Estrada, Huibert A. P. Pols, Frances M. Williams, Wendy L. McAr- dle, Joyce B. van Meurs, Ruth J. F. Loos, Emmanouil T. Dermitzakis, Kourosh R. Ahmadi, Deborah J. Hart, Willem H. Ouwehand, Nicholas J. Wareham, Inês Barroso, Manjinder S. Sandhu, David P. Strachan, Gre- gory Livshits, Timothy D. Spector, André G. Uitterlinden, and Panos De- loukas. “Meta-analysis of genome-wide scans for human adult stature identifies novel Loci and associations with measures of skeletal frame size.” PLoS Genet, 5(4):e1000445, 4 2009. [SSR03] E. Segal, M. Shapira, A. Regev, D. Pe’er, D. Botstein, D. Koller, and N. Friedman. “Module networks: identifying regulatory modules and their condition-specific regulators from gene expression data.” Nat Genet, 34(2):166–76, 2003. [ST03] John Storey and Robert Tibshirani. “Statistical significance for genomewide studies.” Proceedings of the National Academy of Sciences, 100:9440–9445, 2003. [STM05] Aravind Subramanian, Pablo Tamayo, Vamsi K. Mootha, Sayan Mukher- jee, Benjamin L. Ebert, Michael A. Gillette, Amanda Paulovich, Scott L. Pomeroy, Todd R. Golub, Eric S. Lander, and Jill P. Mesirov. “From the Cover: Gene set enrichment analysis: A knowledge-based approach for interpreting genome-wide expression profiles.” Proceedings of the Na- tional Academy of Sciences, 102:15545–15550, October 2005. [Suz93] J. Suzuki. “A construction of Bayesian networks from databases based on an MDL scheme.” In UAI 93, pp. 266–273, 1993. [SWC10] Kristin I. Stanford, Lianchun Wang, Jan Castagnola, Danyin Song, Joseph R. Bishop, Jillian R. Brown, Roger Lawrence, Xaiomei Bai, Hi- roko Habuchi, Masakazu Tanaka, Wellington V. Cardoso, Koji Kimata,

268 and Jeffrey D. Esko. “Heparan sulfate 2-O-sulfotransferase is required for triglyceride-rich lipoprotein clearance.” J Biol Chem, 285(1):286–94, 1 2010.

[Tal07] Philippa J. Talmud. “Gene-environment interaction and its impact on coronary heart disease risk.” Nutr Metab Cardiovasc Dis, 17(2):148–52, 2 2007.

[TBM87] J. A. Todd, J. I. Bell, and H. O. McDevitt. “HLA-DQ beta gene con- tributes to susceptibility and resistance to insulin-dependent diabetes mel- litus.” Nature, 329(6140):599–604, 1987.

[TH09] Jin Tian and Ru He. “Computing Posterior Probabilities of Structural Features in Bayesian Networks.” In Uncertainty in Artificial Intelligence (UAI), 2009.

[THS13] Shunsuke Takasuga, Yasuo Horie, Junko Sasaki, Ge-Hong H. Sun-Wada, Nobuyuki Kawamura, Ryota Iizuka, Katsunori Mizuno, Satoshi Eguchi, Satoshi Kofuji, Hirotaka Kimura, Masakazu Yamazaki, Chihoko Horie, Eri Odanaga, Yoshiko Sato, Shinsuke Chida, Kenji Kontani, Akihiro Harada, Toshiaki Katada, Akira Suzuki, Yoh Wada, Hirohide Ohnishi, and Takehiko Sasaki. “Critical roles of type III phosphatidylinositol phos- phate kinase in murine embryonic visceral endoderm and adult intestine.” Proc Natl Acad Sci U S A, 110(5):1726–31, 1 2013.

[VHH10] Lut Van Laer, Jeroen R. Huyghe, Samuli Hannula, Els Van Eyken, Dietrich A. Stephan, Elina Mäki-Torkko, Pekka Aikio, Erik Fransen, Alana Lysholm-Bernacchi, Martti Sorri, Matthew J. Huentelman, and Guy Van Camp. “A genome-wide association study for age-related hear- ing impairment in the Saami.” Eur J Hum Genet, 18(6):685–93, 6 2010.

[VP90] T. S. Verma and Judea Pearl. “Equivalence and Synthesis of Causal Mod- els.” Technical Report R-150, Department of Computer Science, Univer- sity of California, Los Angeles, 1990.

[VP05] Benjamin F. Voight and Jonathan K. Pritchard. “Confounding from cryp- tic relatedness in case-control association studies.” PLoS Genet, 1(3):e32, 9 2005.

[VSG06] William Valdar, Leah C. Solberg, Dominique Gauguier, Stephanie Burnett, Paul Klenerman, William O. Cookson, Martin S. Taylor, J. Nicholas P. Rawlins, Richard Mott, and Jonathan Flint. “Genome-wide genetic association of complex traits in heterogeneous stock mice.” Nat Genet, 38(8):879–87, 8 2006.

269 [Was04] Larry Wasserman. All of statistics. Springer New York, 2004.

[Wel07] Wellcome Trust Case Control Consortium. “Genome-wide association study of 14,000 cases of seven common diseases and 3,000 shared con- trols.” Nature, 447(7145):661–78, 6 2007.

[WHQ93] Craig H. Warden, Catherine C. Hedrick, Jian-Hua. Qiao, Lawrence W. Castellani, and Aldon. J Lusis. “Atherosclerosis in transgenic mice over- expressing apolipoprotein A-II.” Science, 261(5120):469–72, 7 1993.

[WKZ12] Chen Wu, Peter Kraft, Kan Zhai, Jiang Chang, Zhaoming Wang, Yun Li, Zhibin Hu, Zhonghu He, Weihua Jia, Christian C. Abnet, Liming Liang, Nan Hu, Xiaoping Miao, Yifeng Zhou, Zhihua Liu, Qimin Zhan, Yu Liu, Yan Qiao, Yuling Zhou, Guangfu Jin, Chuanhai Guo, Changdong Lu, Hai- jun Yang, Jianhua Fu, Dianke Yu, Neal D. Freedman, Ti Ding, Wen Tan, Alisa M. Goldstein, Tangchun Wu, Hongbing Shen, Yang Ke, Yixin Zeng, Stephen J. Chanock, Philip R. Taylor, and Dongxin Lin. “Genome-wide association analyses of esophageal squamous cell carcinoma in Chinese identify multiple susceptibility loci and gene-environment interactions.” Nat Genet, 44(10):1090–7, 10 2012.

[WPB03] Tim Wiltshire, Mathew T. Pletcher, Serge Batalov, S. Whitney Barnes, Lisa M. Tarantino, Michael P. Cooke, Hua Wu, Kevin Smylie, An- drey Santrosyan, Neal G. Copeland, Nancy A. Jenkins, Francis Kalush, Richard J. Mural, Richard J. Glynne, Steve A. Kay, Mark D. Adams, and Colin F. Fletcher. “Genome-wide single-nucleotide polymorphism anal- ysis defines haplotype patterns in mouse.” Proc Natl Acad Sci U S A, 100(6):3380–5, 3 2003.

[Wri21] S. Wright. “Correlation and Causation.” Journal of Agricultural Re- search, 20:557–585, 1921.

[WSL09] Cristen J. Willer, Elizabeth K. Speliotes, Ruth J. F. Loos, Shengxu Li, Cecilia M. Lindgren, Iris M. Heid, Sonja I. Berndt, Amanda L. Elliott, Anne U. Jackson, Claudia Lamina, Guillaume Lettre, Noha Lim, Helen N. Lyon, Steven A. McCarroll, Konstantinos Papadakis, Lu Qi, Joshua C. Randall, Rosa Maria Roccasecca, Serena Sanna, Paul Scheet, Michael N. Weedon, Eleanor Wheeler, Jing Hua Zhao, Leonie C. Jacobs, Inga Prokopenko, Nicole Soranzo, Toshiko Tanaka, Nicholas J. Timpson, Pe- ter Almgren, Amanda Bennett, Richard N. Bergman, Sheila A. Bingham, Lori L. Bonnycastle, Morris Brown, Noël P. Burtt, Peter Chines, Lachlan Coin, Francis S. Collins, John M. Connell, Cyrus Cooper, George Davey

270 Smith, Elaine M. Dennison, Parimal Deodhar, Paul Elliott, Michael R. Er- dos, Karol Estrada, David M. Evans, Lauren Gianniny, Christian Gieger, Christopher J. Gillson, Candace Guiducci, Rachel Hackett, David Hadley, Alistair S. Hall, Aki S. Havulinna, Johannes Hebebrand, Albert Hofman, Bo Isomaa, Kevin B. Jacobs, Toby Johnson, Pekka Jousilahti, Zorica Jovanovic, Kay-Tee T. Khaw, Peter Kraft, Mikko Kuokkanen, Johanna Kuusisto, Jaana Laitinen, Edward G. Lakatta, Jian’an Luan, Robert N. Luben, Massimo Mangino, Wendy L. McArdle, Thomas Meitinger, An- tonella Mulas, Patricia B. Munroe, Narisu Narisu, Andrew R. Ness, Kate Northstone, Stephen O’Rahilly, Carolin Purmann, Matthew G. Rees, Mar- tin Ridderstrale, Susan M. Ring, Fernando Rivadeneira, Aimo Ruoko- nen, Manjinder S. Sandhu, Jouko Saramies, Laura J. Scott, Angelo Scu- teri, Kaisa Silander, Matthew A. Sims, Kijoung Song, Jonathan Stephens, Suzanne Stevens, Heather M. Stringham, Y. C. Loraine Tung, Timo T. Valle, Cornelia M. Van Duijn, Karani S. Vimaleswaran, Peter Vollenwei- der, Gerard Waeber, Chris Wallace, Richard M. Watanabe, Dawn M. Wa- terworth, Nicholas Watkins, Wellcome Trust Case Control Consortium, Jacqueline C. M. Witteman, Eleftheria Zeggini, Guangju Zhai, M. Car- ola Zillikens, David Altshuler, Mark J. Caulfield, Stephen J. Chanock, I. Sadaf Farooqi, Luigi Ferrucci, Jack M. Guralnik, Andrew T. Hattersley, Frank B. Hu, Marjo-Riitta R. Jarvelin, Markku Laakso, Vincent Mooser, Ken K. Ong, Willem H. Ouwehand, Veikko Salomaa, Nilesh J. Samani, Timothy D. Spector, Tiinamaija Tuomi, Jaakko Tuomilehto, Manuela Uda, André G. Uitterlinden, Nicholas J. Wareham, Panagiotis Deloukas, Timothy M. Frayling, Leif C. Groop, Richard B. Hayes, David J. Hunter, Karen L. Mohlke, Leena Peltonen, David Schlessinger, David P. Stra- chan, H-Erich E. Wichmann, Mark I. McCarthy, Michael Boehnke, Inês Barroso, Gonçalo R. Abecasis, Joel N. Hirschhorn, and Genetic Investi- gation of ANthropometric Traits Consortium. “Six new loci associated with body mass index highlight a neuronal influence on body weight reg- ulation.” Nat Genet, 41(1):25–34, 1 2009.

[WWM12] Sheng Wei, Li-E E. Wang, Michelle K. McHugh, Younghun Han, Mo- miao Xiong, Christopher I. Amos, Margaret R. Spitz, and Qingyi Wei Wei. “Genome-wide gene-environment interaction analysis for asbestos exposure in lung cancer susceptibility.” Carcinogenesis, 33(8):1531–7, 8 2012.

[WYS06] S. Wang, N. Yehya, E.E. Schadt, H. Wang, T.A. Drake, and A.J. Lusis. “Genetic and genomic analysis of a fat mass trait with complex inheri- tance reveals marked sex specificity.” PLoS genetics, 2(2):e15, 2006.

271 [WZL04] Therese Wiedmer, Ji Zhao, Lilin Li, Quansheng Zhou, Andrea Hevener, Jerrold M. Olefsky, Linda K. Curtiss, and Peter J. Sims. “Adiposity, dyslipidemia, and insulin resistance in mice with targeted deletion of phospholipid scramblase 3 (PLSCR3).” Proc Natl Acad Sci U S A, 101(36):13296–301, 9 2004. [YBM10] Jian Yang, Beben Benyamin, Brian P. McEvoy, Scott Gordon, An- jali K. Henders, Dale R. Nyholt, Pamela A. Madden, Andrew C. Heath, Nicholas G. Martin, Grant W. Montgomery, Michael E. Goddard, and Peter M. Visscher. “Common SNPs explain a large proportion of the her- itability for human height.” Nat Genet, 42(7):565–9, 7 2010. [YBW03] G. Yvert, R.B. Brem, J. Whittle, J.M. Akey, E. Foss, E.N. Smith, R. Mack- elprang, and L. Kruglyak. “Trans-acting regulatory variation in Saccha- romyces cerevisiae and the role of transcription factors.” Nature Genetics, 35:57–64, 2003. [YNB10] Binnaz Yalcin, Jerome Nicod, Amarjit Bhomra, Stuart Davidson, James Cleak, Laurent Farinelli, Magne Osterras, Adam Whitley, Wei Yuan, Xi- angchao Gan, Martin Goodson, Paul Klenerman, Ansu Satpathy, Diane Mathis, Christophe Benoist, David J. Adams, Richard Mott, and Jonathan Flint. “Commercially available outbred mice for genome-wide associa- tion studies.” PLoS Genet, 6(9), 9 2010. [YPB06] Jianming Yu, Gael Pressoir, William H. Briggs, Irie Vroh Bi, Masanori Yamasaki, John F. Doebley, Michael D. McMullen, Brandon S. Gaut, Dahlia M. Nielsen, James B. Holland, Stephen Kresovich, and Edward S. Buckler. “A unified mixed-model method for association mapping that accounts for multiple levels of relatedness.” Nat Genet, 38(2):203–8, 2 2006. [YWF04] Binnaz Yalcin, Saffron A. G. Willis-Owen, Jan Fullerton, Anjela Meesaq, Robert M. Deacon, J. Nicholas P. Rawlins, Richard R. Copley, Andrew P. Morris, Jonathan Flint, and Richard Mott. “Genetic dissection of a behav- ioral quantitative trait locus shows that Rgs2 modulates anxiety in mice.” Nat Genet, 36(11):1197–202, 11 2004. [ZDY09] Qing Yin Zheng, Dalian Ding, Heping Yu, Richard J. Salvi, and Ken- neth R. Johnson. “A locus on distal chromosome 10 (ahl4) affecting age- related hearing loss in A/J mice.” Neurobiol Aging, 30(10):1693–705, 10 2009. [ZE10] N. Zaitlen and E. Eskin. “Imputation aware meta-analysis of genome- wide association studies.” Genetic Epidemiology, 2010.

272 [ZSS08] Eleftheria Zeggini, Laura J. Scott, Richa Saxena, Benjamin F. Voight, Jonathan L. Marchini, Tianle Hu, Paul I. W. de Bakker, Gonçalo R. Abecasis, Peter Almgren, Gitte Andersen, Kristin Ardlie, Kristina Bengtsson Boström, Richard N. Bergman, Lori L. Bonnycas- tle, Knut Borch-Johnsen, Noël P. Burtt, Hong Chen, Peter S. Chines, Mark J. Daly, Parimal Deodhar, Chia-Jen J. Ding, Alex S. F. Doney, William L. Duren, Katherine S. Elliott, Michael R. Erdos, Timothy M. Frayling, Rachel M. Freathy, Lauren Gianniny, Harald Grallert, Niels Grarup, Christopher J. Groves, Candace Guiducci, Torben Hansen, Christian Herder, Graham A. Hitman, Thomas E. Hughes, Bo Isomaa, Anne U. Jackson, Torben Jorgensen, Augustine Kong, Kari Kubalanza, Finny G. Kuruvilla, Johanna Kuusisto, Claudia Langenberg, Hana Lango, Torsten Lauritzen, Yun Li, Cecilia M. Lindgren, Valeriya Lyssenko, Amanda F. Marvelle, Christa Meisinger, Kristian Midthjell, Karen L. Mohlke, Mario A. Morken, Andrew D. Morris, Narisu Narisu, Peter Nilsson, Katharine R. Owen, Colin N. A. Palmer, Felicity Payne, John R. B. Perry, Elin Pettersen, Carl Platou, Inga Prokopenko, Lu Qi, Li Qin, Nigel W. Rayner, Matthew Rees, Jeffrey J. Roix, Anelli Sandbaek, Bev- erley Shields, Marketa Sjögren, Valgerdur Steinthorsdottir, Heather M. Stringham, Amy J. Swift, Gudmar Thorleifsson, Unnur Thorsteinsdot- tir, Nicholas J. Timpson, Tiinamaija Tuomi, Jaakko Tuomilehto, Mark Walker, Richard M. Watanabe, Michael N. Weedon, Cristen J. Willer, Wellcome Trust Case Control Consortium, Thomas Illig, Kristian Hveem, Frank B. Hu, Markku Laakso, Kari Stefansson, Oluf Pedersen, Nicholas J. Wareham, Inês Barroso, Andrew T. Hattersley, Francis S. Collins, Leif Groop, Mark I. McCarthy, Michael Boehnke, and David Altshuler. “Meta-analysis of genome-wide association data and large-scale repli- cation identifies additional susceptibility loci for type 2 diabetes.” Nat Genet, 40(5):638–45, 5 2008.

[ZZS08] J. Zhu, B. Zhang, E.N. Smith, B. Drees, R.B. Brem, L. Kruglyak, R.E. Bumgarner, and E.E. Schadt. “Integrating large-scale functional genomic data to dissect the complexity of yeast regulatory networks.” Nature Ge- netics, 40(7):854, 2008.

273