Computational Modeling of the Relationship Between Snps and Disease
Total Page:16
File Type:pdf, Size:1020Kb
ABSTRACT Title of Document: COMPUTATIONAL MODELING OF THE RELATIONSHIP BETWEEN SNPS AND DISEASE Peng Yue, Doctor of Philosophy, 2005 Directed By: Professor, John Moult, CARB/UMBI We have developed two models, the stability model and the profile model, to identify non-synonymous single base changes (the most common cause of monogenic disease) that have deleterious effects on protein function in vivo. The stability model analyzes the effect of the resulting amino acid change on protein stability by utilizing structural information such as reduction in hydrophobic area and loss of electrostatic interactions. The profile model makes use of the conservation and type of residues observed at a base change position within a protein family. In each model, a machine learning technique, the support vector machine (SVM) was trained on a set of mutations causative of disease, and a control set of non-disease causing mutations. In jack-knifed testing, the stability model identifies 74% of disease mutations, with a false positive rate of 15%; the profile model identifies 80% of disease mutations, with a false positive rate of 10%. Evaluation of a set of in vitro mutagenesis data with the stability model established that the majority of disease mutations affect protein stability by 1 to 3 Kcal/mol. The stability model’s effective distinction between disease and non-disease variants strongly supports the hypothesis that loss of protein stability is a major factor contributing to monogenic disease. Both models are used to identify deleterious SNPs in the human population. After carefully controlling of errors, we find that approximately one-fourth of the known non-synonymous SNPs are deleterious, thus providing a set of possible SNPs contributing to human complex disease traits. A web resource has been developed to provide information on disease/gene relationships at the molecular level. The resource has three primary modules. The first module is used to publish the deleterious SNPs identified by the two above- mentioned models. The second module identifies the candidate genes for a specific disease, and the third module provides information about the relationships between the sets of candidate genes. Disease/candidate gene relationships and gene-gene relationships are derived from the literature using a simple but effective text profiling method. COMPUTATIONAL MODELING OF THE RELATIONSHIP BETWEEN SNPS AND DISEASE By Peng Yue Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2005 Advisory Committee: Professor John Moult, Chair Professor Michael Gilson Professor Richard Payne Associate Professor Stephen Mount Associate Professor Sarah Tishkoff Associate Professor Victor Muñoz, Dean’s representative © Copyright by Peng Yue 2005 Acknowledgements I want to thank Dr. John Moult, my mentor and supervisor, from whom I have learned so much over the years. Without his great insight, knowledge, patience and kindness, this dissertation could not have been written. I am also grateful to all my other committee members, Dr. Michael Gilson, Dr. Richard Payne, Dr. Sarah Tishkoff, and Dr. Stephen Mount. Their helpful advice and guidance have certainly played an essential role in improvement of this work. I would also like to thank Dr. Victor Muñoz for being Dean’s representative. I am thankful to Ms. Zhaolong Li, for her contribution in the stability model. My special thanks go to Mr. Eugene Melamud, with whom I have worked closely in the past six years and developed such deep friendship. His unselfish help have contributed tremendously to the success of this project. Last but not least, my deepest appreciation goes to my family, especially to my wife and my mother. Without their loving support, this long journey toward completion of my study and dissertation is doomed to be a lonely and more arduous one. ii Table of Contents Acknowledgements....................................................................................................... ii Table of Contents.........................................................................................................iii List of Tables ............................................................................................................... vi List of Figures............................................................................................................. vii Chapter 1: Introduction................................................................................................. 1 SNP in Populations and Human Disease .................................................................. 1 Two Types of Human Disease.................................................................................. 2 The Allelic Structure of Human Disease .................................................................. 3 Genetic Mapping of Human Disease ........................................................................ 4 Linkage Analysis .................................................................................................. 4 Association Studies............................................................................................... 4 Bioinformatics........................................................................................................... 6 Computational Modeling of SNPs: Overview ...................................................... 7 Computational Modeling of SNPs: Hypotheses ................................................. 10 SNPs3D: a resource for analysis of SNP, identification of candidate genes and construction of gene relationship networks ........................................................ 11 Technology overview.............................................................................................. 13 Classification....................................................................................................... 13 Comparative modeling........................................................................................ 14 Chapter 2: Loss of Protein Structure Stability as a Major Causative Factor in Monogenic Disease..................................................................................................... 18 Introduction............................................................................................................. 18 Results..................................................................................................................... 20 Selection of Data for Analysis ............................................................................ 20 Analysis of Factors Likely to Affect Protein Stability........................................ 20 Accuracy of the SVM Model.............................................................................. 26 Model Evaluation using in vitro Mutagenesis Stability Data ............................. 28 Alternative Test Sets........................................................................................... 32 Functional Analysis of Single Residue Mutations.............................................. 35 Investigation of the Role of Protein Structure Accuracy.................................... 37 Discussion and Conclusion..................................................................................... 39 Role of Protein Destabilization in Monogenic Disease ...................................... 39 Why does Protein Stability play a Prominent Role in Monogenic Disease?...... 41 Distinguishing Properties of Monogenic Disease Proteins................................. 42 Advantages and Disadvantages of a Protein Structure Based Approach............ 43 Methods................................................................................................................... 44 Identification of Single Residue Variants related to Monogenic Disease .......... 44 Identification of a Set of Single Residue Variants not related to Disease .......... 44 Selection of Sets of Mutants with Protein Structure........................................... 45 Support Vector Machine..................................................................................... 45 SVM Model Training and Testing...................................................................... 46 In-vitro Mutagenesis Data................................................................................... 46 iii Comparative Modeling of Protein Structure....................................................... 47 Modeling the Structure of Single Residue Mutants............................................ 48 Modeling the Stability Impact of a Single Residue Mutant................................ 48 Evaluation of Discrimination power of each Stability Factor ............................ 52 Identification of Residues with a Role in Molecular Function........................... 53 Chapter 3: Identification and Analysis of Deleterious Human SNPs......................... 54 Introduction............................................................................................................. 54 Results..................................................................................................................... 57 Training and Testing Data used for the Classification Methods......................... 57 Accuracy of the Classification Methods............................................................. 57 Sensitivity to the Number of Sequences in a Profile .........................................