Methods for Genome Interpretation: Causal Gene Discovery and Personal Phenotype Prediction

METHODS FOR GENOME INTERPRETATION: CAUSAL GENE DISCOVERY AND PERSONAL PHENOTYPE PREDICTION by Yun-Ching Chen A dissertation submitted to Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy Baltimore, Maryland September, 2014 © Yun-Ching Chen 2014 All rights reserved Abstract Genome interpretation – illustrating how genomic variation affects phenotypic variation – is one of the central questions of the early 21st century. Deciphering the mapping between genotypes and phenotypes requires the collection of a large amount of data, both genetic and phenotypic. Phenotypic profiles, for example, have been systematically recorded and archived in hospitals and national health-related organizations for years. Human genome sequences, however, had not been sequenced in a high throughput manner until next- generation sequencing technologies became available in 2005. Since then, vast amounts of genotype-phenotype data have been collected, allowing for the unprecedented opportunity for genome interpretation. Genome interpretation is an ambitious, poorly understood goal that may require collaboration between many disciplines. In this dissertation, I focus on the development of computational methods for genome interpretation. Based on recent interest in relating genotypes and phenotypes, the task is divided into two stages: discovery (Chapters 2-6) and prediction (Chapters 7-10). In the discovery stage, the location of genomic loci associated with a phenotype of interest is identified based on sequence-based case-control studies. In the prediction stage, I propose a probabilistic model to predict personal phenotypes given an individual’s genome by integrating many sources of information, including the phenotype- associated loci found in the discovery stage. Advisor: Dr. Rachel Karchin Reader: Dr. Joel Bader ii Acknowledgement I would like to acknowledge the people who have helped me during my graduate studies. First, I would like to thank my PhD advisor Dr. Rachel Karchin. I feel grateful for her encouragement and full support of my ideas for my graduate studies. She works very hard bringing together collaborations and connections that help her students in the lab. I admire her enthusiasm for science and the constant effort she puts forth towards research. From her, I learned a lot about how to be a great scientist and pursue impactful and rigorous research. She not only gave me insightful advice regarding my research but also shared her personal experiences with me as guidance for my future directions. I truly feel lucky that I had the opportunity to work for my PhD with her support. I would like to thank my thesis committee, which consisted of Dr. Donald Geman, Dr. Joel Bader and Dr. Dan Arking. I thank them for their generosity and willingness to take time from their busy schedules for my thesis meeting and defense. They all gave me many great ideas and advice to improve my work. I would like to thank my collaborators in the Bipolar disorder project at Hopkins: Dr. James Potash, Dr. Peter Zandi, Dr. Mehdi Pirooznia and Dr. Fernando Goes. They closely collaborated with us to help me on my first project – the development of a causal gene discovery method. Mehdi helped me with the upstream data analysis and discussed every technical detail with me with endless patience. Peter gave great insights on method development. Jimmy and Fernando offered important input from the clinical perspective. I would like to thank the people who have worked with me in the Karchin lab, including Dr. Hannah Carter, Dr. David Masica, Andy Wong, Noushin Niknafs, Christopher Douville, Violeta Beleva Guthrie, Dewey Kim, Jean Fan, Grace Yeo, Cheng Wang, Xinyuan Wang and iii Dr. Josue Samayoa. Hannah was the first Ph.D. from the Karchin lab and I am the second. She is always motivated and helpful and was a great leader in the lab until she left in October 2012. Hannah and Violeta occasionally held the wine tasting event for the lab, which is an unforgettable memory. Andy and Dewey are great colleagues and friends. We spent a lot of time together after work and shared all news good and bad with each other. Noushin and Chris are smart and diligent. We discussed, shared ideas and solved problems together in the past few years. It was my pleasure to work with them. Grace and Cheng helped me a lot on my second project – personal phenotype prediction. Jean and Xinyuan helped me with coding and software implementation during my first and second year in the Karchin lab. Dave and Josue gave me much great advice based on their Ph.D. research experiences. I would like to thank Institute of Computational Medicine (ICM) staffs, especially Christina Anger and Joanne Hall, the system administrators, Kyle Reynolds, Paul Tatarsky and Dr. Anping Liu, and Biomedical Engineering program PhD manager, Hong Lan. Hong, Joanne and Chris helped me a lot with administration stuff. Anping was the system administrator of Karchin lab in my first year and occasionally joined the lab meeting offering some opinions. Kyle and Paul were the system administrators in the past four years. They helped me solving all kinds of system issues. It is hard to believe my research could have been done without their support. I would like to thank those people who have been my (222) roommates: Alan Huang, Lisa Huang, Xindong Song, Yunke Song, Cun-Ting (Amy) Chao, Dan Jiang, Henan Xu, Ai- Cheng (Ariel) Yung, Wen-Ting (Brook) Sung, Wei-Lun (Angela) Chang, Kailun Wan, Xi Chen, Bryce Chiang, Pauline Che and Lin Kuo. They gave me wonderful memories of my Ph.D. life. We cooked and ate dinner together every day. We climbed to the summit of Mt. Whitney, went sea kayaking in Alaska and had fun in Las Vegas. We celebrated every iv Chinese festival, discussed research ideas, cleaned the house, fixed the washing machine, played board games and experienced laughs and tears together. I really appreciate every aspect of their support. They greatly enriched my life as a PhD student. I would like to thank all my friends at Hopkins, especially Nan Li, Ceci Ng, Lingyun Zhao, Tao Yu, Lingje Sang, Alice Ho, Yu-Ja (Kevin) Huang, Wei-Chang Chen, An-Chi Wei, Jackie Li, Jixin Li, Changji Shi and Muzi Na. Their support and friendship made my 6 years in Baltimore wonderful. I would like to thank my Kong-Fu mentor Liang-Wei Chang in Taiwan. He brought me into the essence of Chinese Kong-Fu, and wisdom from ancient China. Kong-Fu training not only brings me a healthy body but also makes me confident in my heart regardless of the circumstances I confront outside. Last, but not least, I am indebted to my family. Their full support is always the best cure for the frustration I sometimes face in my research. Although they know very little about the details of my research, they are always excited to hear about any progress I have made during the video chats we share once every two weeks. They are always with me. I really appreciate the confidence they place in me, as always. Finally and most importantly, I want to thank my wife, Pai-Ling Lo, for her support over the past 6 years, day and night. v To my family vi Table of contents Chapter 1 Overview .................................................................................................................................. 1 1.1 Causal gene discovery .................................................................................................................... 1 1.2 Personal phenotype prediction .................................................................................................... 2 Chapter 2 Complex disease and sequence-based association studies ............................................ 3 2.1 History of association studies ...................................................................................................... 3 2.1.1 Family-based linkage study ................................................................................................... 3 2.1.2 Population-based association study .................................................................................... 4 2.2 Genome-wide association studies ............................................................................................... 5 2.2.1 Common disease common variant (CDCV) hypothesis ............................................... 5 2.2.2 Genome-wide association study (GWAS) ........................................................................ 6 2.2.3 Missing heritability ................................................................................................................. 7 2.3 Sequence-based association study ............................................................................................... 8 2.3.1 Challenge in statistical analysis ............................................................................................ 8 2.3.2 Related work ......................................................................................................................... 10 Chapter 3 Burden Or Mutation Position (BOMP) ......................................................................... 13 3.1 A hybrid likelihood ratio test .................................................................................................... 13 3.2 Mutation burden statistic ........................................................................................................... 14 3.2.1 Individual burden ................................................................................................................ 14 3.2.2 Individual

Methods for Genome Interpretation: Causal Gene Discovery and Personal Phenotype Prediction

Cytogenomic SNP Microarray - Fetal ARUP Test Code 2002366 Maternal Contamination Study Fetal Spec Fetal Cells

Supplementary Information Integrative Analyses of Splicing in the Aging Brain: Role in Susceptibility to Alzheimer’S Disease

A Computational Approach for Defining a Signature of Β-Cell Golgi Stress in Diabetes Mellitus

Germline Risk of Clonal Haematopoiesis

Hippo and Sonic Hedgehog Signalling Pathway Modulation of Human Urothelial Tissue Homeostasis

DIPPER, a Spatiotemporal Proteomics Atlas of Human Intervertebral Discs

Manuscript Submission Manuscript Draft Manuscript Number: BMB-20-087 Title: Emerging Functions for ANKHD1 in Cance

Gene Networks Activated by Specific Patterns of Action Potentials in Dorsal Root Ganglia Neurons Received: 10 August 2016 Philip R

Single-Cell RNA-Seq Identifies Unique Transcriptional Landscapes Of

Insulin Resistance Intrudes AKT2's Protective Effects Against Neural

Genome-Wide Analysis of DNA Methylation and the Gene Expression Change in Lung Cancer

Rare Functional Variant in TM2D3 Is Associated with Late-Onset Alzheimer's Disease