Deep Learning Integration of Molecular and Interactome Data for Protein–Compound Interaction Prediction Narumi Watanabe, Yuuto Ohnuki and Yasubumi Sakakibara*
Total Page:16
File Type:pdf, Size:1020Kb
Watanabe et al. J Cheminform (2021) 13:36 https://doi.org/10.1186/s13321-021-00513-3 Journal of Cheminformatics RESEARCH ARTICLE Open Access Deep learning integration of molecular and interactome data for protein–compound interaction prediction Narumi Watanabe, Yuuto Ohnuki and Yasubumi Sakakibara* Abstract Motivation: Virtual screening, which can computationally predict the presence or absence of protein–compound interactions, has attracted attention as a large-scale, low-cost, and short-term search method for seed compounds. Existing machine learning methods for predicting protein–compound interactions are largely divided into those based on molecular structure data and those based on network data. The former utilize information on proteins and compounds, such as amino acid sequences and chemical structures; the latter rely on interaction network data, such as protein–protein interactions and compound–compound interactions. However, there have been few attempts to combine both types of data in molecular information and interaction networks. Results: We developed a deep learning-based method that integrates protein features, compound features, and multiple types of interactome data to predict protein–compound interactions. We designed three benchmark datasets with diferent difculties and applied them to evaluate the prediction method. The performance evaluations show that our deep learning framework for integrating molecular structure data and interactome data outperforms state-of-the-art machine learning methods for protein–compound interaction prediction tasks. The performance improvement is statistically signifcant according to the Wilcoxon signed-rank test. This fnding reveals that the multi- interactome data captures perspectives other than amino acid sequence homology and chemical structure similarity and that both types of data synergistically improve the prediction accuracy. Furthermore, experiments on the three benchmark datasets show that our method is more robust than existing methods in accurately predicting interac- tions between proteins and compounds that are unseen in training samples. Keywords: Protein–compound interaction, Deep learning, Heterogeneous interaction network, Integration Introduction via experiments is expensive in terms of time and cost. Most compounds that act as drugs bind to target pro- Virtual screening that can computationally classify the teins that can cause disease, and these compounds can presence or absence of protein–compound interactions control their functions. Terefore, it is necessary when has attracted attention as a large-scale, low-cost, short- developing new drugs to search for compounds that can term search method for seed compounds. In particular, interact with the target protein, and this process must be machine learning for virtual screening is considered to be performed efciently. However, determining the inter- applicable to a wide variety of proteins and compounds. action of a large number of protein–compound pairs Machine learning-based methods for predicting pro- tein–compound interactions are largely divided into those based on molecular structure data and those based *Correspondence: [email protected] Department of Biosciences and Informatics, Keio University, 3-14-1 on network data. Te former use protein and compound Hiyoshi, Kohoku-ku, Yokohama, Kanagawa 223-8522, Japan data represented in amino acid sequences and chemical © The Author(s) 2021. This article is licensed under a Creative Commons Attribution 4.0 International License, which permits use, sharing, adaptation, distribution and reproduction in any medium or format, as long as you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons licence, and indicate if changes were made. The images or other third party material in this article are included in the article’s Creative Commons licence, unless indicated otherwise in a credit line to the material. If material is not included in the article’s Creative Commons licence and your intended use is not permitted by statutory regulation or exceeds the permitted use, you will need to obtain permission directly from the copyright holder. To view a copy of this licence, visit http:// creat iveco mmons. org/ licen ses/ by/4. 0/. The Creative Commons Public Domain Dedication waiver (http:// creat iveco mmons. org/ publi cdoma in/ zero/1. 0/) applies to the data made available in this article, unless otherwise stated in a credit line to the data. Watanabe et al. J Cheminform (2021) 13:36 Page 2 of 12 structure formulas, and they can be applied to proteins experiential and computational methods have been when a docking simulation cannot be performed because developed to identify various protein–protein interac- the 3D structure is unknown. In our previous studies tions. High-throughput experimental methods, such [1–3], we performed binary classifcation for predicting as yeast two-hybrid screening, were developed to dis- protein–compound interactions using a support vector cover and validate protein–protein interactions on a machine (SVM) on an interaction dataset downloaded large scale. Computational methods for protein–protein from DrugBank (a database that contains information interaction predictions employ various machine learn- on existing drug compounds) [4]. A prediction accuracy ing methods, such as SVM with feature extraction engi- of 85.1% was achieved. Based on this result, we devel- neering [14]. Te recurrent convolutional neural network oped COPICAT, a comprehensive prediction system (CNN), which is a deep learning method, was applied to for protein–compound interactions, which enabled us sequence-based prediction for protein–protein interac- to search for lead compounds from a large compound tions [15]. Compounds that can interact with each other database, PubChem [5], consisting of tens of millions of are often represented as compound–compound interac- compounds. tions (also known as chemical–chemical interactions), Deep learning, a method developed in the feld of and interactive compounds tend to share similar func- machine learning, has been applied in a variety of felds in tions. Compound–compound interactions, called drug– recent years because it achieves high prediction accuracy drug interactions, can be used to predict side efects in felds such as image recognition, speech recognition, based on the assumption that interacting compounds and compound activity prediction [6]. Deep learning- are more likely to have similar toxicity [16]. A compu- based protein–compound interaction prediction meth- tational approach to compound–compound interaction ods have been developed based on molecular structure predictions has been studied with various machine learn- data [7–10]. However, as these existing deep learning- ing methods, including end-to-end learning with a CNN based methods utilize information based on only amino based on the SMILES representation [17]. acid sequences and chemical structures, the functional Te purpose of this study was to improve prediction properties of proteins and compounds have not yet been accuracy by integrating molecular structure data and incorporated into prediction models. heterogeneous interactome data into a deep learning Te other type of machine learning approach for pro- method for predicting protein–compound interactions. tein–compound interaction prediction is based on net- In addition to the molecular information (amino acid work data. An interaction network is commonly used to sequence and chemical structure) itself, protein–protein comprehensively represent interactions between mol- interaction network data with similar reaction pathways ecules. For example, the protein–protein interaction or physical direct binding and compound network data network represents the relationships among physically linking compounds with similar molecular activities are interacting proteins. In the protein–protein interac- incorporated into the deep learning model as multi-inter- tion network, a node is a protein, and an edge is drawn actome data. To the best of our knowledge, there are no between a pair of proteins that interact with each other. deep learning-based solutions for predicting protein– Some previous studies incorporated data from multiple compound interactions that integrate multiple hetero- interaction networks to predict molecular interactions. geneous interactome data along with the direct input of For instance, multi-modal graphs to handle three types of amino acid sequences and chemical structures. interactions have been proposed: protein–protein, pro- Tis study proposes a method for predicting protein– tein–drug, and polypharmacy side efects. A deep learn- compound (drug-target) interactions by combining pro- ing method for multi-modal graphs, Decagon [11], was tein features, compound features, and network context proposed to predict polypharmacy side efects. DTINet for both proteins and compounds. Te network context [12] and NeoDTI [13] were also designed and developed is in the form of protein–protein interactions from the as graph-based deep learning frameworks to integrate STRING database [18], and the compound–compound heterogeneous networks for drug–target interaction interactions are derived from the STITCH database (DTI) predictions and drug repositioning. In particular, [19]. Te protein–protein interaction network and com- NeoDTI exhibits substantial improvement in perfor- pound–compound interaction network are processed mance over other state-of-the-art