Analysis of Genomic Variants for Investigating the Genetic Etiology of Disease

ANALYSIS OF GENOMIC VARIANTS FOR INVESTIGATING THE GENETIC ETIOLOGY OF DISEASE A DISSERTATION SUBMITTED TO THE DEPARTMENT OF BIOMEDICAL INFORMATICS AND THE COMMITTEE ON GRADUATE STUDIES OF STANFORD UNIVERSITY IN PARTIAL FULFILLMENT OF THE REQUIREMENTS FOR THE DEGREE OF DOCTOR OF PHILOSOPHY Daniel Edmund Newburger March 2015 © 2015 by Daniel Edmund Newburger. All Rights Reserved. Re-distributed by Stanford University under license with the author. This work is licensed under a Creative Commons Attribution- Noncommercial 3.0 United States License. http://creativecommons.org/licenses/by-nc/3.0/us/ This dissertation is online at: http://purl.stanford.edu/kh271wr8164 ii I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Serafim Batzoglou, Primary Adviser I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Jonathan Pritchard I certify that I have read this dissertation and that, in my opinion, it is fully adequate in scope and quality as a dissertation for the degree of Doctor of Philosophy. Arend Sidow Approved for the Stanford University Committee on Graduate Studies. Patricia J. Gumport, Vice Provost for Graduate Education This signature page was generated electronically upon submission of this dissertation in electronic format. An original signed hard copy of the signature page is on file in University Archives. iii Abstract The study of genomic variation within human populations is critical for elucidating the genetic factors that contribute to disease. Identifying and characterizing the genetic architecture of disease advances clinical care by facilitating the development of novel diagnostic tools, the identification of new therapeutic targets, and the practice of personalized treatment for genetic syndromes. The massive volume of genetic data generated by modern genotyping technologies, combined with the informatics challenges of filtering and interpreting these noisy measurements, represent significant obstacles to genomic research. These technical issues necessitate the development of computationally efficient methodologies that leverage raw genotype data for the comparative genomic analysis of complex phenotypes across human subpopulations. In this dissertation, I describe my contributions towards the biomedical study of genetic syndromes using high-throughput genotyping technologies. First, I discuss methods for studying the genome evolution of pre-malignant cancer lesions during progression to breast cancer. Second, I describe algorithms for performing highly accurate variant validation in genomic studies using next generation sequencing. Fi- nally, I present methods for identifying novel disease susceptibility loci in complex diseases using identity by descent mapping in large case-control cohorts. iv Acknowledgements I would like to thank the truly extraordinary mentors, collaborators, and friends who have supported me through both the good times and the terrifying doldrums of graduate school. I simply cannot thank you enough for your patience, wisdom, and friendship. Foremost, I would like to thank my thesis advisor, Serafim. I remain in awe of your ability to deconvolute the most tangled analytical problems into solvable components. You found elegant paths through so many technical obstacles in my research and have always been a wellspring of novel ideas. I am even more grateful to you for your unwavering encouragement and patience. You gave me the freedom to explore far afield, and I feel privileged to be part of your group. I am deeply grateful to Arend Sidow for his guidance, mentorship, and leadership. Arend, you brought vision and scientific rigor to every meeting, and you always managed to make time in your schedule to share your expertise. You taught me how to examine complex problems down to the finest detail, and your forthright advice and criticisms have been invaluable. You are one of the few people who will say what you really think, and yet you are always optimistic and generous in your feedback. I am also indebted to the other members of my reading and orals committees: Rob West, Jonathan Pritchard, and Gavin Sherlock. Rob, your boundless knowledge of cancer genetics and histomorphology drove our cancer genomics projects forward, and I am grateful for all of the time you spent tutoring me in the field. Jonathan, although we didn’t meet until the end of my graduate career, your advice has been penetrating and insightful. Gavin, thank you so much for chairing my defense committee and for your keen questions and suggestions. v My thesis would not have been possible without several other mentors. I would like to thank Hanlee Ji and Sivan Bercovici for their incredible generosity. Hanlee, you coached me through the first years of my PhD with wisdom, precision, and humor. Your relentless pursuit of scientific innovation and your mastery of genetics, oncology, and biotechnology continue to inspire me. Sivan, thanks are entirely inadequate to express my gratitude for your patience, your guidance, and the surfeit of brilliant ideas you contributed to our joint projects. Our meetings have been some of the funniest and most productive moments in my graduate work. I would also like to thank Atul Butte, who first introduced me to bioinformatics as an undergrad, and whose encouragement and counsel propelled me through the first few years of graduate school. I am deeply indebted to my academic advisor, Russ Altman. Russ, your clairvoyant advice during our biannual meetings proved pivotal over and over again, and I can’t thank you enough for ensuring that my meandering thesis evolved into a BMI dissertation. Likewise, I am deeply indebted to my colleague Alex Morgan, who has been exceptionally generous as a mentor. Whether proofreading my fellowship applications in first year or talking me through tough decisions in sixth year, you have always provided singularly thoughtful advice and gone far out of your way to render assistance. Without your help, I would still be floundering in my studies. It has been a joy to be a member of the BMI program. Mary Jeanne, thank you so much for steering me through the tortuous process of navigating graduate school. We in BMI are incredibly lucky to have you at the helm of the BMI program, keeping us from running aground on rocky shores. I would like to thank all the other amazing people who have kept BMI afloat: Nancy Lennartsson, Steve Bagley, John DiMario, Betty Cheng, Larry Fagan, Carol Maxwell, and of course Darlene Vian. I would also like to thank my staunch compatriots in BMI, especially fellow classmates Linda Liu, Nick Tatonetti, and Rob Bruggner. The Batzoglou lab has fostered some of the most amazing folks at Stanford, and I feel incredibly fortunate to call them friends and colleagues. I would especially like to thank Sarah Aerni, Marc Schaub, Tom Do, Sam Gross, Jesse Rodriguez, Sofia Kyriazopoulou-Panagiotopoulou, Anshul Kundaje, Lin Huang, Alex Bishara, vi and Yuling Liu. Marc and Sarah, your friendship and advice meant so much to me as I struggled to orient myself in the lab, and I deeply appreciate your generosity as mentors. Jesse, working with you and learning from you has been a blast. Alex, I still have not watched Clerks. I have been privileged to work with incredible collaborators from outside the lab, as well. I would like to thank Georges Natsoulis, John Bell, Sue Grimes, Patrick Flaherty, Sarah Garcia, Ziming Weng, Noah Spies, Alayne Brunner, Robert Sweeney, and Marina Sirota. Patrick, you inspired me with your commitment to scientific excellence and taught me how to evaluate my projects and research goals. John, I greatly enjoyed kvetching and swapping books during our much-needed coffee breaks. I would like to give special thanks to a few friends without whom I would never have completed my graduate studies. I am incredibly grateful to Dorna Kashef- Haghighi, whose brilliance and hard work made our joint projects in the Batzoglou lab possible, and whose friendship made it fun. Working alongside you was the highlight of graduate school. Tiffany Chen and Tim Lee, I am exceptionally fortunate to be friends with you. You have been my most trusted confidants in matters ranging from research priorities to hunting for good eats in Cupertino. Tiffany, your insight and wisdom regarding matters of both research and career have been invaluable. Tim, your humor, consideration, and scientific advice have kept me sane during times of stress and failure. I hope you get another twelve-win arena run soon. My family has been an inexhaustible source of love and support. Mom, you have always set the highest bar for hard work and dedication to research. I would never have made it through graduate school without your encouragement and, when necessary, admonishments. Dad, you first got me interested in science, and your sage and pragmatic advice has always helped me tackle questions of research and career. Maggie, I can always look to you for both encouragement and commiseration. Finally, completing graduate school would have been inconceivable without the love and support of my wife, Melody. Mel, whether proofreading my papers at midnight, fixing my slides, celebrating victories, or providing consolation, you were always there for me, making life better than better; you make life great. vii Contents Abstract iv Acknowledgements v 1 Introduction 1 2 Background 3 2.1 The genome and disease . .3 2.1.1 Genomic variation . .4 2.1.2 Technologies for genomic studies . .6 2.1.3 Cancer sequencing . 10 3 Genome Evolution in Breast Cancer 13 3.1 Abstract . 13 3.2 Introduction . 14 3.3 Results . 15 3.3.1 Whole-genome sequencing of early neoplasias and related car- cinomas from archival material . 15 3.3.2 Somatic SNVs fall into a limited and highly structured set of classes .

Analysis of Genomic Variants for Investigating the Genetic Etiology of Disease

Proquest Dissertations

DEPARTMENT of HEALTH and HUMAN SERVICES NATIONAL INSTITUTES of HEALTH NATIONAL CANCER INSTITUTE 44Th Meeting BOARD of SCIENTIFIC

The Principled Design of Large-Scale Recursive Neural Network Architectures–DAG-Rnns and the Protein Structure Prediction Problem

Classifying Transport Proteins Using Profile Hidden Markov Models And

Methodology for Predicting Semantic Annotations of Protein Sequences by Feature Extraction Derived of Statistical Contact Potentials and Continuous Wavelet Transform

BIOGRAPHICAL SKETCH NAME: Berger

Program Book

Conference Proceedingssmall

Deep Learning in Chemoinformatics Using Tensor Flow

Course Outline

BMC Bioinformatics Biomed Central

Report from the California Breast Cancer Research Program to the California Legislature: 2010–2015