Computational Prediction of Protein-Protein Interactions On

COMPUTATIONAL PREDICTION OF PROTEIN-PROTEIN INTERACTIONS ON THE PROTEOMIC SCALE USING BAYESIAN ENSEMBLE OF MULTIPLE FEATURE DATABASES A Dissertation Presented to The Graduate Faculty of The University of Akron In Partial Fulfillment of the Requirements for the Degree Doctor of Philosophy Vivek Kumar December, 2011 COMPUTATIONAL PREDICTION OF PROTEIN-PROTEIN INTERACTIONS ON THE PROTEOMIC SCALE USING BAYESIAN ENSEMBLE OF MULTIPLE FEATURE DATABASES Vivek Kumar Dissertation Approved: Accepted: ______________________________ ______________________________ Advisor Department Chair Dr. Dale H. Mugler Dr. Daniel B. Sheffer ______________________________ ______________________________ Committee Member Dean of the College Dr. Daniel B. Sheffer Dr. George K. Haritos ______________________________ ______________________________ Committee Member Dean of the Graduate School Dr. George C. Giakos Dr. George R. Newkome ______________________________ ______________________________ Committee Member Date Dr. Amy Milsted ______________________________ Committee Member Dr. Daniel L. Ely ii ABSTRACT In the post-genomic world, one of the most important and challenging problems is to understand protein-protein interactions (PPIs) on a large scale. They are integral to the underlying mechanisms of most of the fundamental cellular processes. A number of experimental methods such as protein affinity chromatography, affinity blotting, and immunoprecipitation have traditionally helped in detecting PPIs on a small scale. Recently, high-throughput methods have made available an increasing amount of PPI data. However, this data contains a significant amount of erroneous information in the form of false positives and false negatives and shows little overlap among PPIs pooled from different methods, thus severely limiting their reliability. Because of such limitations, computational predictions are emerging to narrow down the set of putative PPIs. In this dissertation, a novel computational PPI predictor was devised to predict PPIs with high accuracy. The PPI predictor integrates a number of proteomic features derived from biological databases. The features chosen for the purpose of this research were gene expression, gene ontology, MIPS functions, sequence patterns such as motifs and domains, and protein essentiality. While these features have little or no correlation with each other, they share some degree of relationship with the ability of proteins to interact with each other. Therefore, novel feature specific approaches were devised to characterize that relationship. Text mining and network topology based approaches were iii also studied. Gold Standard data comprising of high confidence PPIs and non-PPIs was used as evidence of interaction or lack thereof. The predictive power of the individual features was integrated using Bayesian methods. The average accuracy, based on 10-fold cross-validation, was found to be 0.9396. Since all the features are computed on the proteomic scale, the Bayesian integration yields likelihood values for all possible combinations of proteins in the proteome. This has the added benefit of making it possible to enlist putative PPIs in a decreasing order of confidence measure in the form of likelihood values. Integration of novel PPIs with other relevant biological information using Semantic Web representation was examined to better understand the underlying mechanism of diseases and novel target identification for drug discovery. iv ACKNOWLEDGEMENTS I am deeply indebted to many people who have contributed to the completion of my research and graduate studies at The University of Akron. Without their support, patience and guidance, this dissertation work simply would not have been possible. It is to them I owe my deepest gratitude. First and foremost, I would like to thank my advisor, Dr. Dale H. Mugler, for providing me with the opportunity to conduct this interdisciplinary research work under his supervision. I sincerely appreciate his invaluable guidance and unwavering support throughout my graduate studies. I would like to express my deepest gratitude to the department chair, Dr. Daniel B. Sheffer for helping me understand the intricacies of statistics which was vital in laying the foundation for multivariate statistics and machine learning, the tools that will find their use well beyond my doctoral research. I am also grateful to Dr Amy Milsted for teaching the advanced concepts in molecular biology and to Dr Richard Laundraville for providing the opportunity to conduct experiments with a variety of wet lab molecular biology techniques. This experience proved integral to my research work in proteomics. I owe special thanks to my committee members Dr. George C. Giakos and Dr. Daniel L. Ely for their timely guidance and feedback regarding contents and format of this dissertation. v I am particularly indebted to the president of the university, Dr Luis M. Proenza, for recognizing my research efforts by providing financial assistance for three years, in addition to the graduate scholarship offered by the department. Lastly, I would like to thank my family and friends for their unending love, encouragement and support at all stages of my doctoral studies. I will always be indebted to my parents who raised me with a love for science and nature, and encouraged me to live a life of inquiry. vi TABLE OF CONTENTS Page LIST OF FIGURES ........................................................................................................... xi CHAPTER I. INTRODUCTION ............................................................................................................ 1 1.1 Protein-Protein Interactions: An Introduction .......................................................... 1 1.2 Motivation ................................................................................................................ 4 1.3 Objectives ................................................................................................................. 5 1.4 Organization of the Dissertation ............................................................................... 6 II. LITERATURE REVIEW ................................................................................................ 8 2.1 Yeast as Model Organism for Research in Molecular Biology ................................ 8 2.2 Types of Protein-Protein Interactions ..................................................................... 10 2.3 High-throughput Methods for Detecting PPIs.........................................................11 2.3.1 Kinetics of Protein-Protein Interaction Assays ............................................... 12 2.3.2 Yeast Two-Hybrid (Y2H) System .................................................................. 14 2.3.3 Variations of Yeast Two-Hybrid (Y2H) System ............................................ 19 2.3.4 Protein Fragment Complementation Assays ................................................... 21 2.3.5 Co-immunoprecipitation ................................................................................. 23 2.3.6 Protein Microarrays ........................................................................................ 25 2.4 PPI Databases ......................................................................................................... 27 vii 2.4.1 Database of Interacting Proteins (DIP) ........................................................... 27 2.4.2 Biological General Repository for Interaction Datasets (BioGRID) .............. 28 2.4.3 Biomolecular Interaction Network Database (BIND) .................................... 29 2.4.4 IntAct .............................................................................................................. 30 2.4.5 Molecular INTeraction (MINT) ...................................................................... 30 2.5 Computational Approaches to Predict PPIs............................................................ 31 2.6 Protein-Protein Interaction Topology a`nd Prediction ........................................... 33 2.7 Genomic Sequences and Protein-Protein Interactions ........................................... 36 2.8 Motifs, Domains and Protein-Protein Interactions ................................................. 37 2.9 Gene Ontology and Protein-Protein Interactions ................................................... 40 2.9.1 GO Topology Based Semantic Similarity....................................................... 42 2.9.2 Information Theory Based Semantic Similarity ............................................. 43 2.9.3 Hybrid Approach Based Semantic Similarity ................................................. 45 2.10 Gene Expression and Protein-Protein Interactions ............................................... 46 2.11 Protein Essentiality and Protein-Protein Interactions ........................................... 50 2.12 Text Mining and Protein-Protein Interactions ...................................................... 52 2.13 Protein-Protein Interaction Prediction using Integrative Approaches .................. 56 III. MATERIALS AND METHODS ................................................................................. 58 3.1 Research Hypothesis .............................................................................................. 58 3.2 ORFs – Interchangeability with Genes and Proteins.............................................. 60 3.3 Proposed PPI Prediction Techniques ...................................................................... 63 3.3.1

Load more