Prediction of Protein-Protein Interactions and Essential Genes Through Data Integration

Prediction of Protein-Protein Interactions and Essential Genes Through Data Integration by Max Kotlyar A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Department of Medical Biophysics University of Toronto Copyright °c 2011 by Max Kotlyar Abstract Prediction of Protein-Protein Interactions and Essential Genes Through Data Integration Max Kotlyar Doctor of Philosophy Graduate Department of Department of Medical Biophysics University of Toronto 2011 The currently known network of human protein-protein interactions (PPIs) is providing new insights into diseases and helping to identify potential therapies. However, according to several estimates, the known interaction network may represent only 10% of the entire interactome – indicating that more comprehensive knowledge of the interactome could have a major impact on understanding and treating diseases. The primary aim of this thesis was to develop computational methods to provide increased coverage of the interactome. A secondary aim was to gain a better understanding of the link between networks and phenotype, by analyzing essential mouse genes. Two algorithms were developed to predict PPIs and provide increased coverage of the interactome: F pClass and mixed co-expression. F pClass differs from previous PPI prediction methods in two key ways: it integrates both positive and negative evidence for protein interactions, and it identifies synergies between predictive features. Through these approaches F pClass provides interaction networks with significantly improved reli- ability and interactome coverage. Compared to previous predicted human PPI networks, FpClass provides a network with over 10 times more interactions, about 2 times more proteins and a lower false discovery rate. This network includes 595 disease related proteins from OMIM and Cancer Gene Census which have no previously known interactions. The second method, mixed co-expression, aims to predict transient PPIs, which have proven ii difficult to detect by computational and experimental methods. Mixed co-expression makes predictions using gene co-expression and performs significantly better (p < 0.05) than the previous method for predicting PPIs from co-expression. It is especially effective for identifying interactions of transferases and signal transduction proteins. For the second aim of the thesis, we investigated the relationship between gene essentiality and diverse gene/protein features based on gene expression, PPI and gene co-expression networks, gene/protein sequence, Gene Ontology, and orthology. We iden- tified non-redundant features closely associated with essentiality, including centrality in PPI and gene co-expression networks. We found that no single predictive feature was effective for all essential genes; most features, including centrality, were less effective for genes associated with postnatal lethality and infertility. These results suggest that understanding phenotype will require integrating measures of network topology with information about the biology of the network’s nodes and edges. iii Acknowledgements Many people helped me complete this thesis and made the PhD experience much more enjoyable. First, I would like to thank my supervisor, Dr. Igor Jurisica, for providing a huge amount of help with all aspects of the thesis and great encouragement throughout the whole process. I would also like to thank the Jurisica Lab, past and present. Special thanks to: Kristen, for amazing editing help and great collaborations; Daniela, for solidarity throughout the PhD process and very helpful thesis edits; Abraham, for excellent editing and ideas; Levi, for helping me adopt R and introducing the concept of apples with peanut butter; Dan, for urgent help with editing and biology questions; Kevin for lots of help with the cluster, I2D, and biology questions; Yun, for text mining and very helpful ideas; Richard, for getting me through many computer crashes; Christian, for keeping the cluster going despite my repeated attempts to crash it; Ali, Dave, and Wing, for lots of Navigator help; Adrian, Attila, and Frederic for help with I2D; Elize, for great solidarity and support; Anthony for JavaScript help; Amira, Chiara, Fiona, and Periklis, for lots of help with slides and biology; Sara and Serene, for encouragement and reviewing ideas; Mahima and Dene for great discussions and debates. And of course family, for incredible support – no thesis without them. iv Contents 1 Introduction 1 1.1 Mapping the human interactome . 1 1.2 Types of PPIs . 4 1.3 Experimental PPI Identification Methods . 5 1.3.1 Small-scale screens . 5 1.3.2 HTP screens . 5 1.4 Computational PPI Identification Methods . 8 1.4.1 Methods using genomic analysis . 9 1.4.2 Methods using protein primary structure . 9 1.4.3 Methods using protein domains . 10 1.4.4 Methods using protein tertiary structure . 11 1.4.5 Methods analyzing interaction networks . 12 1.4.6 Methods using data integration . 12 1.5 PPIs networks and phenotype . 13 1.6 Summary of Research Contributions . 15 1.6.1 Chapter 2 . 15 1.6.2 Chapter 3 . 15 1.6.3 Chapter 4 . 15 2 Predicting human PPIs using non-independent features 17 v 2.1 Introduction . 18 2.2 Results . 23 2.2.1 Complementarity of computational and HTP experimental methods 23 2.2.2 False discovery rates of predictions . 25 2.2.3 Coverage of predicted networks . 28 2.2.4 Evidence for novel predicted PPIs . 33 2.3 Discussion . 37 2.3.1 Evaluating predictions . 37 2.3.2 Using predictive features in novel ways . 38 2.3.3 Predicting interactions of disease genes . 40 2.4 Methods . 43 2.4.1 Predicting interactions . 43 2.4.2 Calculating enrichment of PPI sets among predictions . 45 2.5 Supplemental Materials and Methods . 46 2.5.1 Links to data tables . 46 2.5.2 Datasets . 46 2.5.3 Prediction overview . 47 2.5.4 Features of individual proteins: description and sources . 48 2.5.5 Calculating interaction scores from features of individual proteins 49 2.5.6 Features of protein pairs . 55 2.5.7 Score integration . 57 3 Predicting Transient PPIs From Gene Co-expression 58 3.1 Introduction . 60 3.2 Results . 65 3.2.1 Correlation distributions of stably and transiently interacting proteins have significant differences . 65 vi 3.2.2 Transient interactions have significantly lower correlations than stable interactions . 67 3.2.3 Predictions of PPIs from gene co-expression are improved by con- sidering local information . 68 3.2.4 Mixed approach improves recall of different interaction types . 73 3.2.5 Mixed and local approaches work best when expression datasets have genes with high variance . 75 3.3 Discussion . 77 3.4 Conclusions . 79 3.5 Methods . 80 3.5.1 Selection of gene expression datasets . 80 3.5.2 Processing of gene expression datasets . 80 3.5.3 Transient and stable interactions . 81 3.5.4 Transiently and stably interacting proteins . 81 3.5.5 Selection of protein pairs for test sets . 81 3.5.6 Testing for significant differences in correlation distribution tails . 82 3.5.7 Interaction prediction approaches . 83 3.5.8 Calculating significance of performance differences . 84 3.5.9 Selection of functional categories . 85 3.5.10 Software . 85 3.6 Supplemental Materials . 86 3.6.1 Links to data tables . 86 4 Predicting Essential Mouse Genes 87 4.1 Introduction . 88 4.2 Results . 90 4.2.1 Defining essential and non-essential mouse genes . 90 4.2.2 Identifying features related to essentiality . 91 vii 4.2.3 Key features of essential genes . 96 4.2.4 Differences among essential genes . 99 4.2.5 Predicting essential genes . 101 4.3 Discussion . 103 4.3.1 Features from gene expression . 105 4.3.2 Features based on gene and protein sequence . 106 4.3.3 Networks and essentiality . 109 4.4 Conclusions . 111 4.5 Methods . 112 4.5.1 Gene phenotypes . 112 4.5.2 Gene expression . 112 4.5.3 Gene co-expression networks . 114 4.5.4 PPI networks . 117 4.5.5 Gene length . 118 4.5.6 Gene Ontology . 118 4.5.7 Ortholog information . 118 4.5.8 Protein structure . 119 4.5.9 Protein disorder . 119 4.5.10 Sequence biases . 119 4.5.11 Assessing relationships between predictive variables and essentiality 120 4.5.12 Identifying non-redundant predictive features . 121 4.5.13 Identifying significant differences between different types of essential genes . 123 4.5.14 Predicting essential genes . 123 4.6 Supplemental Materials . 123 4.6.1 Links to data tables . 123 viii 5 Discussion 125 5.1 FpClass: extending coverage of the human interactome . 126 5.1.1 Contributions and limitations . 127 5.2 Mixed co-expression: identifying transient interactions and interaction context . 129 5.2.1 Contributions and limitations . 130 5.3 Characterizing and predicting essential mouse genes . 131 5.3.1 Contributions and Limitations . 132 5.4 Future improvements to introduced methods . 133 5.5 Future applications of methods and results . 135 5.5.1 Prediction of pathways and functional modules . 135 5.5.2 Prediction of protein function . 135 5.5.3 Prediction of disease genes . 136 5.5.4 Identification of module regulators . 136 5.6 Conclusions . 136 Bibliography 138 ix Abbreviations AUC Area under receiver operating characteristic (ROC) CDS Coding sequence DIP Database of Interacting Proteins false positives FDR False discovery rate (= false positives + true positives ) false positives FPR False positive.

Prediction of Protein-Protein Interactions and Essential Genes Through Data Integration

Gene Prediction and Genome Annotation

Gene Structure Prediction

A Curated Benchmark of Enhancer-Gene Interactions for Evaluating Enhancer-Target Gene Prediction Methods

There Is a Lot of Research on Gene Prediction Methods

Gene Prediction Using Deep Learning

"An Overview of Gene Identification: Approaches, Strategies, and Considerations"

A Benchmark Study of Ab Initio Gene Prediction Methods in Diverse Eukaryotic Organisms

Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D

Progress in Gene Prediction: Principles and Challenges Srabanti Maji and Deepak Garg*

PATTERNS of DIPEPTIDE USAGE for GENE PREDICTION a Thesis

Bioinformatics Is a New Discipline That Addresses the Need to Manage and Interpret the Data That in the Past Decade Was Massively Generated by Genomic Research

Gnomon – NCBI Eukaryotic Gene Prediction Tool