Leveraging Large-Scale Datasets to Understand
Total Page:16
File Type:pdf, Size:1020Kb
LEVERAGING LARGE-SCALE DATASETS TO UNDERSTAND THE INTERACTION BETWEEN THE GENOME AND THE EPIGENOME by Leandros Boukas A dissertation submitted to Johns Hopkins University in conformity with the requirements for the degree of Doctor of Philosophy. Baltimore, Maryland June, 2020 c 2020 Leandros Boukas All rights reserved Abstract Epigenetics is typically described as a layer of molecular information above and beyond the DNA sequence. While this conceptualization is certainly ac- curate to some extent, there is also a tight connection between the genome and the epigenome, as the basic components of the epigenetic machinery (EM) are DNA-encoded. This thesis focuses on four such genetic components: genes encoding for the proteins of the histone machinery, genes encoding for the pro- teins of the DNA methylation machinery, genes encoding for chromatin remod- elers, and CpG dinucleotides. We first perform a systematic analysis of all human EM genes, and characterize them with respect to their tolerance to variation, both at the whole-gene level, and the local, protein domain level. We then discover a systems-level property (co-expression), that is specifically exhibited by a large subset of variation-intolerant EM genes, and may be par- ticularly relevant to their involvement in neurodevelopment. Finally, we shift our focus on the CpG dinucleotides. We show that a high promoter CpG density is not merely a generic feature of human promoters, but is preferentially en- ii ABSTRACT countered at the promoters of the most loss-of-function intolerant genes. This coupling calls into question the prevailing view that CpG islands are not sub- ject to selection. It also has practical utility, as it allows us to train a simple and easily interpretable predictive model of loss-of-function intolerance that outperforms existing predictors and classifies 1,760 genes - which are currently unascertained - as highly loss-of-function-intolerant or not. Together, the re- sults presented in this thesis provide new insights into the interaction between the genome and the epigenome. Advisors and Readers: Kasper Daniel Hansen, PhD, and Hans Tomas Bjornsson, MD, PhD iii Acknowledgments I am indebted to many individuals for their contribution to my work during the past 5 years. First and foremost, I am deeply grateful to my advisors, Kasper and Hans. While having two advisors is not always guaranteed to succeed, in my case it was a true blessing and privilege. In particular, I would have never gotten accepted into the Human Genetics PhD program at Johns Hopkins, if Hans hadn’t generously offered me the op- portunity to work in his lab as a visiting student in the summer of 2014. Since then, I have watched from the front seat how he asks bold and important sci- entific questions, how he attacks them from many possible angles, and how he has developed a groundbreaking research program aimed at finding cures for his patients, while always keeping an eye on the basic science. For a budding physician-scientist like me, Hans is an ideal role model. Aside from allowing me to get into this PhD program, Hans did another cru- cial thing for my training: he suggested that I work with Kasper. Following this suggestion turned out to be one of the best decisions I have ever made. Data iv ACKNOWLEDGMENTS analysis is both a science and an art, that I believe is impossible to truly learn unless one works closely, and for an extended period of time, with a true expert. I have seen how Kasper uses his solid foundation in theoretical statistics, and an understanding of biology that, over the years, has become as deep as that of biologists, to approach meaningful problems in a manner that is both rigorous and intuitive. Whenever I walked into his office confused about some analysis, I always, without exception, walked out with things being much clearer in my head. I have also learned quite a lot from our many discussions over lunch, which I have very much enjoyed. Finally, I greatly appreciate the total freedom Kasper and Hans have given me to explore my own ideas, even when they are not directly related to their other research projects. If I manage to be half as good of an advisor as they have been to me, my future students will be extremely lucky. I also have to give special thanks to Dr. Valle and Sandy. I am very proud to call myself a product of the Human Genetics program, and it’s clear that their deep commitment is one of the driving forces behind it. They have created an environment where it is possible for us students to live and breathe genetics. The many departmental activities and courses have certainly helped me ac- quire breadth that I would not have acquired had I only been focused on my own research. Thanks to my thesis committee members, Dr. Valle, Dani Fallin, and Alexis v ACKNOWLEDGMENTS Battle. I especially want to thank Alexis who provided us with constructive feedback for the co-expression analysis. I also thank Jill Fahrner, with whom we have joint lab meetings, and who has often provided me with very useful feedback. Thanks also to Loyal Goff and Dimitrios Avramopoulos, with whom I spent two very pleasant rotations in my first year. In addition, I have had some very stimulating interactions and discussions with Kirby Smith, Barbara Migeon, Haig Kazazian, and Stephanie Hicks. Sincere thanks to Priya Duggal and Jennifer Deal, who accepted me into the MD-GEM program. The courses I took as part of it definitely helped me understand how to think about population-scale genetics. Of course, I want to also thank my classmates. It has been great to have their friendship and support, and to be able to share this journey with them. I have also been very fortunate to have friends outside Hopkins, with whom we made some great memories. Thanks to Thanos, Kleio, Chris, Maria, Greta, and Antonis. Very special thanks have to go to Ilias and Mike (who is my best friend from med school and now also my roommate) - I hope the three of us will continue to share this journey through residency training. I am grateful to the Jenkins family - Stella, Larry, Christian, and Daniel. Ever since I came to Baltimore they adopted me as their third child (which, of course, they didn’t have to do), and it has been great to know that I have a vi ACKNOWLEDGMENTS family here, even though my real parents are far away. I don’t need to say much about my best friends from Greece - Thomas, Ja- nis, Kostas, Dimitris, and Iordanis. Ever since we graduated from college we disseminated around the globe, but our friendship (brotherhood, indeed) has no borders. The final year of my PhD was by far the happiest for me. There is only one reason for this, which has nothing to do with the science. It is that I have been able to share my life with Giota, and I look forward to many more years with her. Last, but not least, I wouldn’t be standing here without the endless and unconditional love and support from my parents, Effie and Andreas, who is also responsible for inspiring my love for science. I wholeheartedly dedicate this thesis to them. vii To my mom and dad viii Contents Abstract ii Acknowledgments iv List of Tables xv List of Figures xvi 1 Introduction 1 2 Co-expression patterns define epigenetic regulators associated with neurological dysfunction 5 2.1 Preface . 5 2.2 Introduction . 5 2.3 Results . 7 2.3.1 The modular composition of the epigenetic machinery . 7 2.3.2 The human epigenetic machinery is highly intolerant to variation and contains many additional disease candidates 10 ix CONTENTS 2.3.3 Dual function epigenetic regulators and remodelers are the most variation-intolerant categories . 14 2.3.4 The intolerance to variation is primarily driven by the do- mains mediating the epigenetic function . 16 2.3.5 A large subset of the epigenetic machinery is co-expressed 19 2.3.6 Dual function epigenetic regulators are enriched in the highly co-expressed group and are co-expressed with mul- tiple other categories . 23 2.3.7 The highly co-expressed epigenetic regulators are extremely intolerant to variation and enriched for genes causing neu- rological dysfunction . 25 2.3.8 Brain-specific regulatory elements of highly co-expressed epigenetic regulators are enriched for SNPs that explain the heritability of common neurological traits. 28 2.3.9 The promoters of highly co-expressed genes of the epige- netic machinery are bound by common trans-acting factors 30 2.4 Discussion . 31 2.5 Methods . 35 2.5.1 The creation of an epigenetic regulator list . 35 2.5.2 Epigenetic regulators with disease associations . 37 2.5.3 Variation tolerance analysis . 40 x CONTENTS 2.5.4 CCR local constraint score . 40 2.5.5 GTEx data . 41 2.5.6 Tissue specificity and expression level analysis . 42 2.5.7 Co-expression analysis . 43 2.5.8 Trans-acting factor binding at EM gene promoters . 47 2.5.9 Enrichment of disease genes in the highly co-expressed group . 49 2.5.10 Stratified LD score regression . 50 2.5.11 Genome assembly version . 51 2.5.12 Code availability . 52 2.5.13 Acknowledgments . 52 2.6 Supplemental Materials . 53 2.6.1 Supplemental Results . 53 2.6.1.1 Variation intolerance of EM genes encoded on the sex chromosomes . 53 2.6.1.2 Tissue specificity and expression levels of EM genes 53 2.6.1.3 The highly co-expressed genes are not enriched for protein-protein interactions .