In Silico Edgetic Profiling and Network Analysis of Human Genetic Variants, with an Application to Disease Module Detection
Total Page:16
File Type:pdf, Size:1020Kb
In Silico Edgetic Profiling and Network Analysis of Human Genetic Variants, with an Application to Disease Module Detection by Hongzhu Cui A Dissertation Submitted to the Faculty of the WORCESTER POLYTECHNIC INSTITUTE In partial fulfillment of the requirements for the Degree of Doctor of Philosophy in Bioinformatics and Computational Biology by May 2020 APPROVED: Professor Dmitry Korkin Professor Amity L. Manning Worcester Polytechnic Institute Worcester Polytechnic Institute Advisor Committee Member Professor Zheyang Wu Professor Manoj Bhasin Worcester Polytechnic Institute Emory University Committee Member External Committee Member Professor Dmitry Korkin Worcester Polytechnic Institute Head of Department Home Rooms – S4E3 Slapstick – S3E9 Backwash – S2E7 “I love the first day, man. “…while you’re waiting for “Don’t worry kid. You’re still Everybody all friendly an’ moments that never come.” on the clock.” shit” – Freamon – Horseface – Namond Brice Collateral Damage – S2E2 Unto Others – S4E7 The Detail – S1E2 “They can chew you up, but “Aw yeah. That golden rule.” “You cannot lose if you do they gotta spit you out.” – Bunk not play.” – McNulty – Marla Daniels Boys of Summer – S4E1 Hard Cases – S2E4 “Lambs to the slaughter The Pager – S1E5 “If I hear music, I’m gonna here.” “…a little slow, a little late.” dance.” – Marcia Donnelly – Avon Barksdale – Greggs Cleaning Up – S1E12 Took – S5E7 Time After Time – S3E1 “This is me, yo, right here.” “They don’t teach it in law “Don’t matter how many – Wallace school.” times you get burnt, you just – Pearlman keep doin’ the same.” – Bodie The Buys – S1E3 “The king stay the king.” The Wire – S1E6 – D’Angelo “…and all the pieces matter.” Port in a Storm – S2E12 – Freamon “Business. Always business.” – The Greek The Dickensian Aspect – S5E6 Final Grades – S4E13 “If you have a problem with “If animal trapped call 410- Moral Midgetry – S3E8 this, I understand 844-6286” “Crawl, walk, and then run.” completely.” – Baltimore, traditional – Clay Davis – Freamon - Some opening credit quotes from my favorite TV show The Wire https://www.youtube.com/watch?v=QqNNpGlUKHw Abstract In the past several decades, Next Generation Sequencing (NGS) methods have produced large amounts of genomic data at the exponentially increasing rate. It has also enabled tremendous advancements in the quest to understand the molecular mechanisms underlying human complex traits. Along with the development of the NGS technology, many genetic variation and genotype–phenotype databases and functional annotation tools have been developed to assist scientists to better understand the intricacy of the data. Together, the above findings bring us one step closer towards mechanistic understanding of the complex phenotypes. However, it has rarely been possible to translate such a massive amount of information on mutations and their associations with phenotypes into biological or therapeutic insights, and the mechanisms underlying genotype-phenotype relationships remain partially explained. Meanwhile, increasing evidence shows that biological networks are essential, albeit not sufficient, for the better understanding of these mechanisms. Among them, protein- protein interaction (PPI) network studies have attracted perhaps most attention. Our overarching goal of this dissertation is to (i) perform a systematic study to investigate the role of pathogenic human genetic variant in the interactome; (ii) examine how common population-specific SNVs affect PPI network and how they contribute to population phenotypic variance and disease susceptibility; and (iii) develop a novel framework to incorporate the functional effect of mutations for disease module detection. In this dissertation, we first present a systematic multi-level characterization of human mutations associated with genetic disorders by determining their individual and combined interaction-rewiring effects on the human interactome. Our in-silico analysis highlights the intrinsic differences and important similarities between the pathogenic single nucleotide variants (SNVs) and frameshift mutations. Functional profiling of SNVs indicates widespread disruption of the protein-protein interactions and synergistic effects of SNVs. The coverage of our approach is several times greater than the recently published experimental study and has the minimal overlap with it, while the distributions of determined edgotypes between the two sets of profiled mutations are remarkably similar. Case studies reveal the central role of interaction- disrupting mutations in type 2 diabetes mellitus and suggest the importance of studying mutations that abnormally strengthen the protein interactions in cancer. Second, aided with our SNP-IN tool, we performed a systematic edgetic profiling of population specific non-synonymous SNVs and interrogate their role in the human interactome. Our results demonstrated that a considerable amount of normal nsSNVs can cause disruptive impact to the interactome. We also showed that genes enriched with disruptive mutations associated with diverse functions and have implications in various diseases. Further analysis indicates that distinct gene edgetic profiles among major populations can help explain the population phenotypic variance. Finally, network analysis reveals phenotype-associated modules are enriched with disruptive mutations and the difference of the accumulated damage in such modules may suggest population-specific disease susceptibility. Lastly, we propose and develop a computational framework, Discovering most IMpacted SUbnetworks in interactoMe (DIMSUM), which enables the integration of genome-wide association studies (GWAS) and functional effects of mutations into the protein–protein interaction (PPI) network to improve disease module detection. Specifically, our approach incorporates and propagates the functional impact of non- synonymous single nucleotide polymorphisms (nsSNPs) on PPIs to implicate the genes that are most likely influenced by the disruptive mutations, and to identify the module with the greatest functional impact. Comparison against state-of-the-art seed-based module detection methods shows that our approach could yield modules that are biologically more relevant and have stronger association with the studied disease. With the advancement of next-generation sequencing technology that drives precision medicine, there is an increasing demand in understanding the changes in molecular mechanisms caused by the specific genetic variation. The current and future in-silico edgotyping tools present a cheap and fast solution to deal with the rapidly growing datasets of discovered mutations. Our work shows the feasibility of a large- scale in-silico edgetic study and revealing insights into the orchestrated play of mutations inside a complex PPI network. We also expect for our module detection method to become a part of the common toolbox for the disease module analysis, facilitating the discovery of new disease markers. Acknowledgements I would like to extend my sincere gratitude to my dissertation advisor Professor Dmitry Korkin. He is a respectful and resourceful scholar. He has also exemplified to me what entails to do academic research: hard work, open-mindedness, scientific attitude, and perseverance. These have become invaluable assets in my life. I owe my gratitude to my dissertation committee members: Professor Zheyang Wu, Professor Amity Manning and Professor Manoj Bhasin, for their help, patience and encouragement, which support me keeping improving this dissertation. A special thanks to my fellow Korkin lab members over the years, including but not limited to: Nan Zhao, Andi Dhroso, Nathan Johnson, Katie Hughes, Oleksandr Narykov, Pavel Terentiev, Suhas Srinivasan, Ana Leshchyk, Nan Hu, Ziyang Gao, Senbao Lu. You all have been an invaluable resource and good friends. Finally, I would like to thank my parents and my big brother. I thank them for their unconditional love and support, and all the precious values they taught me that have become part of my life. Contents Chapter 1 Introduction ________________________________________________________ 14 1.1 Motivation ____________________________________________________________________ 14 1.2 Research Objectives _____________________________________________________________ 16 1.2.1 Systematically characterize mutations associated with genetic diseases in the interactome and explore their functional role______________________________________________________________________ 16 1.2.2 Perform an edgotype based analysis of population specific SNV in human interactome ____________ 17 1.2.3 Develop a computational framework incorporating the functional impact of the mutations for disease module detection. _______________________________________________________________________ 18 1.3 Dissertation Organization ________________________________________________________ 18 Chapter 2 Background and Related Work _________________________________________ 20 2.1 Human Genetic Variants _________________________________________________________ 20 2.1.1 Single Nucleotide Variants (SNVs) ______________________________________________________ 21 2.1.2 Genetic Variation Detection Techniques _________________________________________________ 22 2.1.3 Databases on Genetic Variation ________________________________________________________ 23 2.1.4 Functional Annotation of Genetic Variations ______________________________________________ 24 2.2 Complex Genetic Diseases ________________________________________________________