Predicting Disease-Related Genes by Path-Based Similarity and Community Structure in Protein-Protein Interaction Network

Manuscript Thursday, July 20, 2017 Predicting disease-related genes by path-based similarity and community structure in protein-protein interaction network Ke Hu1, Jing-Bo Hu1, Ju Xiang2,*, Hui-Jia Li3,4,*, Yan Zhang5,*, Shi Chen5, Chen-He Yi7 1Department of Physics, Xiangtan University Xiangtan, Xiangtan 411105, Hunan China 2Neuroscience Research Center & Department of Basic Medical Sciences, Changsha Medical University, Changsha 410219, Hunan, China 3School of Management Science and Engineering, Central University of Finance and Economics, Beijing 100080, China 4Academy of Mathematics and Systems Science, Chinese Academy of Sciences, Beijing 100190, China 5Department of Computer, Changsha Medical University, Changsha 410219, Hunan, China 6School of Public Administration, Xiangtan University, Xiangtan 411105, Hunan, China * Corresponding authors: Ju Xiang or Hui-Jia Li or Yan Zhang. E-mail: [email protected] (J.X.); [email protected] (J.X.); [email protected] (H.J.L.); [email protected] (Y.Z.); [email protected] (K.H); [email protected] (S.C.) Abstract: Network-based computational approaches to predict unknown genes associated with certain diseases are of considerable significance for uncovering the molecular basis of human diseases. In this paper, we proposed a kind of new disease-gene-prediction methods by combining the path-based similarity with the community structure in the human protein-protein interaction network. Firstly, we introduced a set of path-based similarity indices, a novel community-based similarity index, and a new similarity combining the path-based similarity index. Then we assessed the statistical significance of the measures in distinguishing the disease genes from non-disease genes, to confirm their availability in predicting disease genes. Finally, we applied these measures to the disease-gene prediction of single disease-gene family, and analyzed the performance of these measures in disease-gene prediction, especially the effect of the community structure on the prediction performance in detail. The results indicated that genes associated with the same or similar diseases commonly reside in the same community of the protein-protein interaction network, and the community structure is greatly helpful for the disease-gene prediction. PACS: 89.75.–k; 89.75.Fb; 89.75.Hc Keywords: Complex networks; Community structure; Topological similarity; Protein-protein interaction networks; Disease genes 1 Manuscript Thursday, July 20, 2017 CONTENTS 1. Introduction .................................................................................................................................................... 2 2. Datasets ........................................................................................................................................................... 3 2.1. Human PPI Datasets ........................................................................................................................... 3 2.2. Disease-Gene Data ............................................................................................................................... 3 3. Methods ........................................................................................................................................................... 4 3.1. Definition of the topological similarity .............................................................................................. 4 3.1.1. Path-based Similarity (PS)....................................................................................................... 4 3.1.2. Community-based Similarity (CS) .......................................................................................... 5 3.1.3. Combined similarity based on path structure and community structure ........................... 6 3.2. Similarity scores of genes with respect to the disease genes ............................................................ 7 3.3. Metric ................................................................................................................................................... 7 4. Experimental results ...................................................................................................................................... 8 4.1. Analysis of feasibility ........................................................................................................................... 8 4.2. Performance of method..................................................................................................................... 10 4.2.1 ROC and AUC ......................................................................................................................... 10 4.2.2. Precision .................................................................................................................................. 10 4.2.3. Discussion ................................................................................................................................. 11 5. Conclusion ..................................................................................................................................................... 12 Acknowledgement ............................................................................................................................................ 13 References ......................................................................................................................................................... 13 1. Introduction Identification of the hereditary disease-genes from human genome is one of the most important tasks in bioinformatics research[1, 2]. The traditional methods such as the positional cloning via linkage analysis were applied to discover the disease-related genes[3, 4], but they encounter difficulties, such as the variable disease penetrance [5] and the large number of genes among large family datasets that need to be analyzed[6, 7]. The former is an inherent problem that directly limits the performance of these traditional methods, while the latter is a labor-intensive task, which will cost much man power and resources. Therefore, it is necessary to develop the efficient computational approaches to predict unknown genes associated with certain diseases before the experimental research[8, 9]. Up to now, many features and resources have been employed to predict disease-related genes, such as sequence features[10], gene functional annotations[11, 12], and protein-protein interaction (PPI) network[13, 14], etc. Among them, the coverage of the gene functional annotations is limited, though they play an important role in disease candidate gene prioritization. In the past decades, benefit from the rapid development of experimental technologies (e.g., yeast two-hybrid technology[15]), more and more PPI data are becoming available, which greatly promoted the study of the disease-gene discovery[16]. Various candidate gene prioritization methods have been proposed based on the PPI network analysis[12, 13, 17-20]. The theoretical basis of these network-based approaches is that genes associated with the same or similar disease phenotype, are not distributed randomly in the network, and they possess many common topological features, for example, they usually have high connectivity, reside in more central location of the PPI network and cluster together[21-23]. Based on these topological properties, many scoring systems that match automatically these 2 Manuscript Thursday, July 20, 2017 topological features have been developed[24-29], for example, Xu and Li developed a classifier to predict disease genes by combining five different topological features[29]. A long-held and partially proved assumption shared by biologists is that genes associated with the same or similar diseases phenotypes are likely to be functionally related, and thus they commonly reside in the same neighborhood of molecular networks. For example, method based on direct neighbors of disease genes was very simple, but not the most efficient in disease-gene prediction[30]; then, Hsu et al considered the indirect neighbors on the basis of the direct neighbors of disease genes[31]; while Zhu et al proposed a kind of method integrating the direct neighbors, the indirect neighbors and the shortest path length between candidate and known disease genes[32]. It is worth noting that module structure is one of important properties in the PPI networks[33], while, in the previous studies, little attention was focused on the module property of the disease genes. It was clear that, proteins play their functions in a modular fashion, and mutations of proteins in the same module may lead to similar disease phenotypes[34]. The modular feature in the PPI network can be properly characterized by the community structure in network theory[35, 36]. Therefore, the community structure should be a more direct and robust property to capture a functional modularity in the PPI network. Further, it is expected that genes associated with the same or similar disease phenotypes commonly reside in the same community of the PPI network, and the community structure might be greatly helpful for disease-gene prediction. In this paper, based on the PPI network, we firstly introduce a set of path-based similarity indices and a novel community-based similarity index, and then a new similarity combining the path-based similarity with community-based similarity is designed. Secondly, we assess the statistical significance of the measures in distinguishing the

Predicting Disease-Related Genes by Path-Based Similarity and Community Structure in Protein-Protein Interaction Network

Gene Prediction and Genome Annotation

Gene Structure Prediction

A Curated Benchmark of Enhancer-Gene Interactions for Evaluating Enhancer-Target Gene Prediction Methods

There Is a Lot of Research on Gene Prediction Methods

Gene Prediction Using Deep Learning

Prediction of Protein-Protein Interactions and Essential Genes Through Data Integration

"An Overview of Gene Identification: Approaches, Strategies, and Considerations"

A Benchmark Study of Ab Initio Gene Prediction Methods in Diverse Eukaryotic Organisms

Bioinformatics: a Practical Guide to the Analysis of Genes and Proteins, Second Edition Andreas D

Progress in Gene Prediction: Principles and Challenges Srabanti Maji and Deepak Garg*

PATTERNS of DIPEPTIDE USAGE for GENE PREDICTION a Thesis

Bioinformatics Is a New Discipline That Addresses the Need to Manage and Interpret the Data That in the Past Decade Was Massively Generated by Genomic Research