Integrative Methods for the Analysis of Genome Wide Association Studies

INTEGRATIVE METHODS FOR THE ANALYSIS OF GENOME WIDE ASSOCIATION STUDIES A DISSERTATION SUBMITTED TO THE DEPARTMENT OF COMPUTER SCIENCE AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Marc A. Schaub June 2012 © 2012 by Marc Andreas Schaub. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/ This dissertation is online at: http://purl.stanford.edu/qt820xd3631 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Serafim Batzoglou, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Atul Butte I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. David Dill Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii Abstract Genome Wide Association Studies (GWAS) have identified over 4,500 common vari- ants in the human genome that are statistically associated with diseases and other phenotypical traits. Most identified associations, however, only have a small effect on disease risk, and their relevance in a clinical setting remains the subject of extensive debate. In this thesis I present three integrative analysis directions that extend on GWAS by developing new methods, by using genotyping data to ask new questions, and by integrating additional types of data to generate functional hypotheses about the biological processes underlying associations. First, I introduce a new classifier-based methodology that identifies similarities in the genetic architecture of diseases. This method can successfully identify both known and novel relationships between common diseases such as type 1 diabetes, rheumatoid arthritis, hypertension and bipolar disease. Second, I show how control individuals from a GWAS can be used to detect genetic differences between the pseudoautosomal regions of chromosomes X and Y in the general population, which can be attributed to differences in allele frequency between the two sex chromosomes likely caused by selective pressure. Finally, I present an approach that integrates experimental data generated by the ENCODE consortium in order to identify functional Single Nucleotide Polymorphisms (SNPs). These functional SNPs are associated with a phenotype, either directly or through linkage disequilibrium, and overlap a functional part of the genome such as a transcribed region or a transcription factor binding site. GWAS associations are significantly enriched for functional annotations, and up to 80% of all associations iv previously reported in a GWAS can be mapped to a functional SNP. For most associations the functional SNP most strongly supported by experimental evidence is a SNP in linkage disequilibrium with the reported association rather then the reported SNP itself. v Acknowledgment I would like to thank my advisor Serafim Batzoglou for his advice, feedback, encouragements and support throughout my graduate career, for giving me the freedom to explore a broad range of research directions, and for bringing together such a truly outstanding group of researchers. In particular, I would like to thank George Asi- menos and Chuong Do for all their highly valuable advice early in my graduate career, Anshul Kundaje for his advice and support during the second half of my thesis work, and Irene Kaplow for having been such a fantastic summer student. I would like to thank Atul Butte for all the feedback and encouragement throughout my thesis work, for inviting me to his group meetings, for serving on my qualifying examination, defense and reading committees, and for giving me the privilege of col- laborating closely with two amazing Ph.D. students in his group, Marina Sirota and Linda Liu. Working with Marina and Linda has certainly been the favorite part of my research work at Stanford, and I'm very grateful for everything they have done to make these joint projects such a successful and tremendously enriching experience. I would like to thank Michael Snyder for giving me the opportunity to collaborate with his group on the analysis of the ENCODE data, for his advice and support, and for serving on my defense committee, Ross Hardison for his support of my work on linking ENCODE and GWAS data, David Dill for serving on my qualifying exam, defense and reading committees and Arend Sidow for chairing my thesis defense. I would like to thank my friends and colleagues in the Batzoglou, Butte and Snyder labs and in the Stanford biomedical research community for their support, encouragements, advice, feedback, and the many fruitful discussions we had about research and life in general: Sarah Aerni, Andy Beck, Sivan Bercovici, Alan Boyle, Leticia vi Britos, David Chen, Rong Chen, Tiffany Chen, Annie Chiang, Erik Corona, Michelle Davison, Eugene Davydov, Omkar Deshpande, Joel Dudley, Robert Edgar, Megan El- more, Sangeeta English, Patrick Flaherty, Jason Flannick, Chuan Sheng Foo, Eugene Fratkin, Yael Garten, Andrew Gentles, Sam Gross, Adam Grossman, Philip Guo, Manoj Hariharan, Lin Huang, Nadine Hussami, Robert Ikeda, Konrad Karczewski, Dorna Kashef-Haghighi, Peter Kang, Purvesh Khatri, Keiichi Kodama, Andy Kogel- nik, Sofia Kyriazopoulou-Panagiotopoulou, Wei-Nchih Lee, Daniel Li, Li Li, Max Libbrecht, Irene Liu, Yuling Liu, Alex Morgan, Daniel Newburger, Tony Novak, Jon Palma, Chirag Patel, Victoria Popic, Yannick Pouliot, Dmitry Pushkarev, Jesse Rodriguez, Jon Rodriguez, David Ruau, Olga Russakovsky, Karen Sachs, Raheleh Salari, Nicelio Sanchez-Luege, Shai Shen-Orr, Andreas Sundquist, Silpa Suthram, Nick Tatonetti, Rob Tirrell, Shivkumar Venkatasubrahmanyam, Dan Webster and Noah Zimmerman. My research work would not have been possible without the outstanding techni- cal and administrative support of Miles Davis, Kathi DiTommaso, Sebastian Gutier- rez, Alex Sandra Pinedo, Alex Skrenchuk, Tanya Raschke, Liliana Rivera and Verna Wong. During my time at Stanford, I had the privilege of being involved in a broad range of extracurricular activities. I would like to thank all my friends in Stan- ford EMS, and in particular Florian Schmitzberger, Chris Cheung, Brian Cheung, Glenn Ulansey, Lauren Mamer, Mark Liao and James Liao, the teaching staff of the Stanford EMT program, the Stanford Wilderness Medicine instructor team, and the Escondido Village Community Associates for their friendship, encouragements and support throughout my graduate career, and for just being an amazing group of peo- ple! These programs tremendously enriched my experience at Stanford, and would not have been possible without the support of the Department of Public Safety, the Division of Emergency Medicine and Stanford Outdoor Education. I would like to thank the Graduate Life Office, and in particular Ken Hsu, Laurette Beeson and Anne Boswell for their support of my work as a Community Associate in Studio 2, and all the great work they do to assist the Stanford graduate student community in general. vii While many miles away, my friends and family in Switzerland, and in particular Frédéric Evéquoz,GrégoryMermoud´ and GrégoryThéoduloz as well as my brother Alain have always been very supportive of my work. Finally, none of this would have been possible without the unwavering support of my family throughout my entire career. I am deeply grateful to my father Andreas and my mother Margrith for everything they have done in order to give me the opportunity to follow my interests, and for always encouraging me to do so, even when it meant living nine timezones away from home. Danke viel, vielmals füralles! Joint Work Chapter 3 and Sections 2.1 and 2.2 of Chapter 2 are a reproduction, in part, of a previously published article: M.A. Schaub, I.M. Kaplow, M. Sirota, C.B. Do, A.J. Butte, S. Batzoglou. A Classifier-based Approach to Identify Genetic Similarities Between Diseases. Bioin- formatics 25: i21-29. 2009. I would like to thank my co-authors Irene M. Kaplow, Marina Sirota, Chuong B. Do, Atul J. Butte and Serafim Batzoglou for their contributions to this project. I conceived and designed the study, performed all data preprocessing, implemented the version of the decision tree classifier used to obtain the reported results, analyzed the data and wrote the manuscript. Irene M. Kaplow performed exploratory research comparing various classifiers, which lead to the choice of the Decision Tree classifier we used. Marina Sirota revised Figure 3.1, and designed the version shown herein. Chuong B. Do and Marina Sirota provided input and feedback on the study design and data analysis. Atul J. Butte and Serafim Batzoglou helped conceive the study and supervised the study. All authors revised the manuscript. Chapter 4 and Section 2.3 of Chapter 2 represent joint work that will become part of a manuscript to be submitted after the time of submission of this thesis. I would like to thank my co-authors on this upcoming manuscript Linda Y. Liu, Marina Sirota, Serafim Batzoglou and Atul J. Butte for their contributions to this viii project. Linda Y. Liu and I jointly conceived and designed the study. Linda Y. Liu performed the analysis on the WTCCC data set. I performed the analysis on the HapMap 3 dataset, developed the modified Hardy-Weinberg model, identified the sequence homology issue leading to false positives in autosomes, and wrote the chapter.

Integrative Methods for the Analysis of Genome Wide Association Studies

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

Evaluation of the NOD/SCID Xenograft Model for Glucocorticoid-Regulated

(COPD) and Lung Cancer by Means of Cell Specific

A Dissertation Entitled the Androgen Receptor

Identification of Potential Key Genes in Gastric Cancer Using Bioinformatics Analysis

Research Article Identification of Key Genes and Pathways in Triple-Negative Breast Cancer by Integrated Bioinformatics Analysis

Regulation of the Glucocorticoid Receptor Via a BET

Supplementary Table 1

Insights Into MYC Biology Through Investigation of Synthetic Lethal Interactions with MYC Deregulation

Genomic Signature of Parity in the Breast of Premenopausal Women

Molecular Characterization of Breast and Lung Tumors by Integration of Multiple Data Types with Sparse-Factor Analysis

PRODUCT SPECIFICATION Prest Antigen C9orf152 Product