Computational Prediction of Compound–Protein Interactions for Orphan Targets Using CGBVS
Total Page:16
File Type:pdf, Size:1020Kb
molecules Article Computational Prediction of Compound–Protein Interactions for Orphan Targets Using CGBVS Chisato Kanai 1,* , Enzo Kawasaki 1,* , Ryuta Murakami 1, Yusuke Morita 2 and Atsushi Yoshimori 3 1 Data Science Division, INTAGE Healthcare Inc., 2F NREG Midosuji Bldg., 3-5-7 Kawara-Machi, Chuo-ku, Osaka 541-0048, Japan; [email protected] 2 Business Development Division, Advanced Technology Department, INTAGE Inc., Akihabara Building, 3 Kanda-Neribeicho, Chiyoda-ku, Tokyo 101-8201, Japan; [email protected] 3 Institute for Theoretical Medicine Inc., 26-1 Muraoka-Higashi 2-Chome, Fujisawa 251-0012, Japan; [email protected] * Correspondence: [email protected] (C.K.); [email protected] (E.K.) Abstract: A variety of Artificial Intelligence (AI)-based (Machine Learning) techniques have been developed with regard to in silico prediction of Compound–Protein interactions (CPI)—one of which is a technique we refer to as chemical genomics-based virtual screening (CGBVS). Prediction calculations done via pairwise kernel-based support vector machine (SVM) is the main feature of CGBVS which gives high prediction accuracy, with simple implementation and easy handling. We studied whether the CGBVS technique can identify ligands for targets without ligand information (orphan targets) using data from G protein-coupled receptor (GPCR) families. As the validation method, we tested whether the ligand prediction was correct for a virtual orphan GPCR in which all ligand information for one selected target was omitted from the training data. We have specifically Citation: Kanai, C.; Kawasaki, E.; expressed the results of this study as applicability index and developed a method to determine Murakami, R.; Morita, Y.; whether CGBVS can be used to predict GPCR ligands. Validation results showed that the prediction Yoshimori, A. Computational Prediction of Compound–Protein accuracy of each GPCR differed greatly, but models using Multiple Sequence Alignment (MSA) as Interactions for Orphan Targets Using the protein descriptor performed well in terms of overall prediction accuracy. We also discovered CGBVS. Molecules 2021, 26, 5131. that the effect of the type compound descriptors on the prediction accuracy was less significant than https://doi.org/10.3390/ that of the type of protein descriptors used. Furthermore, we found that the accuracy of the ligand molecules26175131 prediction depends on the amount of ligand information with regard to GPCRs related to the target. Additionally, the prediction accuracy tends to be high if a large amount of ligand information for Academic Editor: Humbert related proteins is used in the training. González-Díaz Keywords: orphan GPCR; virtual orphan GPCR; enrichment factor (EF); area under receiver operat- Received: 20 July 2021 ing characteristics (AUROC) Accepted: 23 August 2021 Published: 24 August 2021 Publisher’s Note: MDPI stays neutral 1. Introduction with regard to jurisdictional claims in published maps and institutional affil- Post-genome research has been providing a large amount of omics data on genes and iations. proteins, including genomes, transcriptomes, and proteomes. On the other hand, the devel- opment of technologies such as high-throughput screening has led to the accumulation of compound and bioactivity information on a vast number of compounds and drugs. This information is published in public databases such as ChEMBL [1–3] and PubChem [4] and are freely available to use. Such bioactivity information between compounds and Copyright: © 2021 by the authors. Licensee MDPI, Basel, Switzerland. proteins is also referred to as drug–target interaction (DTI) and in a broad context, simply This article is an open access article Compound–Protein interaction (CPI). The research to utilize such data has been been distributed under the terms and one of the major hot topics in the field of drug discovery. Many drugs affect the human conditions of the Creative Commons body in the form of drug effects or side effects through interactions with biomolecules Attribution (CC BY) license (https:// such as target proteins. This is why the identification of CPIs is an important issue in creativecommons.org/licenses/by/ drug discovery research. However, accurate and comprehensive identification of CPIs in 4.0/). experiments is almost impossible due to the enormous costs involved. In recent years, Molecules 2021, 26, 5131. https://doi.org/10.3390/molecules26175131 https://www.mdpi.com/journal/molecules Molecules 2021, 26, 5131 2 of 14 various Artificial Intelligence (AI) technologies have been developed to predict CPIs (or DTIs) on a large scale by effectively utilizing the vast amount of bioactivity data that has been accumulated [5–14]. In the early stages of drug discovery, ligands that act on target proteins are often insufficiently identified or not even identified at all. In addition, we cannot expect to get a lot of information on the three-dimensional structure of target proteins in these cases. The in silico approach does not work well in situations where known active ligands and protein structural information is extremely limited. Nevertheless, in order to move forward with the drug discovery project, we should also consider trying in silico approaches when we want a ligand even if it is less active. In such cases, AI technology for CPI prediction is promising as an in silico approach. One of the many AI-based methods for predicting CPIs is called the CGBVS technique. This technique is theoretically simpler than other methods, can be implemented without difficulty, and gives sufficient accuracy of prediction. Hamanaka et al. [14] have imple- mented CGBVS with a deep neural network (CGBVS-DNN) that enabled training of over a million CPIs. Wassermann et al. [7], using a machine learning approach similar to CGBVS, used a limited number of protease targets as an example to test the prediction accuracy for orphan targets. They have concluded that ligand information of nearest neighbors is essential for a good prediction of ligands of orphan targets. We studied the following two aspects when using the CGBVS technique. The first aspect is how accurate the ligand prediction is for orphan targets. In this study, we focused on the G protein-coupled receptor (GPCR) family, which is an important and data-rich target in drug discovery. Out of the available 243 possible targets, we randomly selected 52 GPCRs. We created 52 machine learning models of CGBVS, omitting all the ligand information for one particular target GPCR per model. That is, we created a virtual orphan GPCR per model and tested whether the model could predict the ligands for that virtual orphan GPCR. We also investigated how the accuracy of ligand prediction for 52 selected virtual orphan GPCRs is affected by the combination of compound and protein descriptors used in the machine learning process. The second aspect is to examine the conditions and applicability of high prediction accuracy. Here, we first introduced an applicability index which helped us determine whether it is possible to apply CGBVS to true orphan GPCRs. 2. Materials and Methods 2.1. CGBVS For the purpose of investigating the relationship between the applicability of the CGBVS method to ligand prediction of orphan GPCRs and the protein kernel, we used SVM instead of Deep Neural Network. The CGBVS technique we used is mostly implemented according to the method studied by Yabuuchi et al. [8], but the machine learning method for Support Vector Machine (SVM) [15] is slightly different from the original CGBVS technique. The reason is that the kernel function part of SVM is clearly divided into a compound- derived part and a protein-derived part for each calculation. The method of calculating this SVM is the same as the method used in the work of Wassermann et al. [7]. Letting c be the compound vector and p be the protein vector, we can then express the Compound–Protein interaction vector (CPI vector) x as their tensor product x = c ⊗ p. The SVM kernel for the CPI vector can then be expressed by [16]: 0 0 0 K(x, x ) = KC(c, c ) · KP(p, p ), (1) where KC and KP are the compound and protein kernels, respectively. Since the compound and protein kernels can be calculated independently, there is no need to explicitly calculate the matrix representation of the tensor product as a representation of the actual CPI vector. A schematic diagram of the calculation procedure of our CGBVS technique is shown in Figure1. The first step is to prepare the structural formula data set of the compounds and the amino acid sequence data set of the proteins for machine learning. From a compound Molecules 2021, 26, 5131 3 of 14 structural formula, descriptors such as physicochemical parameters and fingerprints are calculated and converted into a compound vector. From an amino acid sequence, the descriptors associated with the strings are calculated and converted into a protein vector. The two vectors created are combined according to bioactivity data to create the CPI vector. If the activity value of the data is higher than the set threshold, it is a positive CPI vector; otherwise, it is a negative CPI vector. The CGBVS model is created by machine learning via SVM of positive and negative CPI vectors based on aforementioned kernels. This CGBVS model allows us to predict the activity of unknown Compound–Protein combinations. The usual SVM score is the value of the distance from the discriminative surface (decision function), but this value is sigmoidally transformed to perform probability estimation [17]. Compound