Computational Modeling of the Relationship Between Snps and Disease

ABSTRACT Title of Document: COMPUTATIONAL MODELING OF THE RELATIONSHIP BETWEEN SNPS AND DISEASE Peng Yue, Doctor of Philosophy, 2005 Directed By: Professor, John Moult, CARB/UMBI We have developed two models, the stability model and the profile model, to identify non-synonymous single base changes (the most common cause of monogenic disease) that have deleterious effects on protein function in vivo. The stability model analyzes the effect of the resulting amino acid change on protein stability by utilizing structural information such as reduction in hydrophobic area and loss of electrostatic interactions. The profile model makes use of the conservation and type of residues observed at a base change position within a protein family. In each model, a machine learning technique, the support vector machine (SVM) was trained on a set of mutations causative of disease, and a control set of non-disease causing mutations. In jack-knifed testing, the stability model identifies 74% of disease mutations, with a false positive rate of 15%; the profile model identifies 80% of disease mutations, with a false positive rate of 10%. Evaluation of a set of in vitro mutagenesis data with the stability model established that the majority of disease mutations affect protein stability by 1 to 3 Kcal/mol. The stability model’s effective distinction between disease and non-disease variants strongly supports the hypothesis that loss of protein stability is a major factor contributing to monogenic disease. Both models are used to identify deleterious SNPs in the human population. After carefully controlling of errors, we find that approximately one-fourth of the known non-synonymous SNPs are deleterious, thus providing a set of possible SNPs contributing to human complex disease traits. A web resource has been developed to provide information on disease/gene relationships at the molecular level. The resource has three primary modules. The first module is used to publish the deleterious SNPs identified by the two above- mentioned models. The second module identifies the candidate genes for a specific disease, and the third module provides information about the relationships between the sets of candidate genes. Disease/candidate gene relationships and gene-gene relationships are derived from the literature using a simple but effective text profiling method. COMPUTATIONAL MODELING OF THE RELATIONSHIP BETWEEN SNPS AND DISEASE By Peng Yue Dissertation submitted to the Faculty of the Graduate School of the University of Maryland, College Park, in partial fulfillment of the requirements for the degree of Doctor of Philosophy 2005 Advisory Committee: Professor John Moult, Chair Professor Michael Gilson Professor Richard Payne Associate Professor Stephen Mount Associate Professor Sarah Tishkoff Associate Professor Victor Muñoz, Dean’s representative © Copyright by Peng Yue 2005 Acknowledgements I want to thank Dr. John Moult, my mentor and supervisor, from whom I have learned so much over the years. Without his great insight, knowledge, patience and kindness, this dissertation could not have been written. I am also grateful to all my other committee members, Dr. Michael Gilson, Dr. Richard Payne, Dr. Sarah Tishkoff, and Dr. Stephen Mount. Their helpful advice and guidance have certainly played an essential role in improvement of this work. I would also like to thank Dr. Victor Muñoz for being Dean’s representative. I am thankful to Ms. Zhaolong Li, for her contribution in the stability model. My special thanks go to Mr. Eugene Melamud, with whom I have worked closely in the past six years and developed such deep friendship. His unselfish help have contributed tremendously to the success of this project. Last but not least, my deepest appreciation goes to my family, especially to my wife and my mother. Without their loving support, this long journey toward completion of my study and dissertation is doomed to be a lonely and more arduous one. ii Table of Contents Acknowledgements....................................................................................................... ii Table of Contents.........................................................................................................iii List of Tables ............................................................................................................... vi List of Figures............................................................................................................. vii Chapter 1: Introduction................................................................................................. 1 SNP in Populations and Human Disease .................................................................. 1 Two Types of Human Disease.................................................................................. 2 The Allelic Structure of Human Disease .................................................................. 3 Genetic Mapping of Human Disease ........................................................................ 4 Linkage Analysis .................................................................................................. 4 Association Studies............................................................................................... 4 Bioinformatics........................................................................................................... 6 Computational Modeling of SNPs: Overview ...................................................... 7 Computational Modeling of SNPs: Hypotheses ................................................. 10 SNPs3D: a resource for analysis of SNP, identification of candidate genes and construction of gene relationship networks ........................................................ 11 Technology overview.............................................................................................. 13 Classification....................................................................................................... 13 Comparative modeling........................................................................................ 14 Chapter 2: Loss of Protein Structure Stability as a Major Causative Factor in Monogenic Disease..................................................................................................... 18 Introduction............................................................................................................. 18 Results..................................................................................................................... 20 Selection of Data for Analysis ............................................................................ 20 Analysis of Factors Likely to Affect Protein Stability........................................ 20 Accuracy of the SVM Model.............................................................................. 26 Model Evaluation using in vitro Mutagenesis Stability Data ............................. 28 Alternative Test Sets........................................................................................... 32 Functional Analysis of Single Residue Mutations.............................................. 35 Investigation of the Role of Protein Structure Accuracy.................................... 37 Discussion and Conclusion..................................................................................... 39 Role of Protein Destabilization in Monogenic Disease ...................................... 39 Why does Protein Stability play a Prominent Role in Monogenic Disease?...... 41 Distinguishing Properties of Monogenic Disease Proteins................................. 42 Advantages and Disadvantages of a Protein Structure Based Approach............ 43 Methods................................................................................................................... 44 Identification of Single Residue Variants related to Monogenic Disease .......... 44 Identification of a Set of Single Residue Variants not related to Disease .......... 44 Selection of Sets of Mutants with Protein Structure........................................... 45 Support Vector Machine..................................................................................... 45 SVM Model Training and Testing...................................................................... 46 In-vitro Mutagenesis Data................................................................................... 46 iii Comparative Modeling of Protein Structure....................................................... 47 Modeling the Structure of Single Residue Mutants............................................ 48 Modeling the Stability Impact of a Single Residue Mutant................................ 48 Evaluation of Discrimination power of each Stability Factor ............................ 52 Identification of Residues with a Role in Molecular Function........................... 53 Chapter 3: Identification and Analysis of Deleterious Human SNPs......................... 54 Introduction............................................................................................................. 54 Results..................................................................................................................... 57 Training and Testing Data used for the Classification Methods......................... 57 Accuracy of the Classification Methods............................................................. 57 Sensitivity to the Number of Sequences in a Profile .........................................

Computational Modeling of the Relationship Between Snps and Disease

Genotype/Phenotype Analysis in a Male Patient with Partial Trisomy 4P and Monosomy 20Q Due to Maternal Reciprocal Translocation (4;20): a Case Report

The DNA Sequence and Comparative Analysis of Human Chromosome 20

Mitochondrial Sequestration of GABA by Aralar in Drosophila Cyfip Haploinsufficiency Causes Deficits in Social Group Behavior

Dissecting the Genetics of Human Communication

Pankreas-Forschungslabor Chirurgische Klinik Und Poliklinik Klinikum Rechts Der Isar

ACO2 Clinicobiological Dataset with Extensive Phenotype Ontology Annotation

UC Davis UC Davis Previously Published Works

The Role of Tks5 Sh3 Domains in Invadopodia Development and Activity

Genetic Analysis on Familial Nonsyndromic Hearing Loss Using Next-Generation Sequencing

A Multi-Factorial Analysis of Response to Warfarin in a UK Prospective Cohort Stephane Bourgeois1, Andrea Jorgensen2, Eunice J