Boosting Docking-Based Virtual Screening with Deep Learning
Total Page:16
File Type:pdf, Size:1020Kb
Boosting Docking-based Virtual Screening with Deep Learning Janaina Cruz Pereira,∗,y Ernesto Raúl Caffarena,∗,y and Cicero Nogueira dos Santos∗,z yFiocruz, 4365 Avenida Brasil, Rio de Janeiro, RJ 21040 900, Brazil zIBM Watson, 1101 Kitchawan Rd, Yorktown Heights, NY 10598, USA E-mail: janaina.pereira@ioc.fiocruz.br; ernesto@fiocruz.br; [email protected] Abstract Introduction In this work, we propose a deep learning ap- Drug discovery process is a time-consuming and proach to improve docking-based virtual screen- expensive task. The development and even the ing. The introduced deep neural network, repositioning of already known compounds is a DeepVS, uses the output of a docking program difficult chore.1 The scenario gets worse if we and learns how to extract relevant features from regard the thousands or millions of molecules basic data such as atom and residues types ob- capable of being synthesized in each develop- tained from protein-ligand complexes. Our ap- ment stage.2,3 proach introduces the use of atom and amino In the past, experimental methods such as acid embeddings and implements an effective high-throughput screening (HTS) could help way of creating distributed vector representa- making this decision through the screening of tions of protein-ligand complexes by modeling large chemical libraries against a biological tar- the compound as a set of atom contexts that get. However, the high cost of the whole pro- is further processed by a convolutional layer. cess associated with a low success rate turns One of the main advantages of the proposed this method inaccessible to the academia.2–4 method is that it does not require feature engi- In order to overcome these difficulties, the neering. We evaluate DeepVS on the Directory use of low-cost computational alternatives is ex- of Useful Decoys (DUD), using the output of tensively encouraged, and it was adopted rou- two docking programs: AutodockVina1.1.2 and tinely as a way to aid in the development of new Dock6.6. Using a strict evaluation with leave- drugs.3–5 arXiv:1608.04844v2 [q-bio.QM] 21 Nov 2016 one-out cross-validation, DeepVS outperforms Computational virtual screening works basi- the docking programs in both AUC ROC and cally as a filter (or a prefilter) consisting of the enrichment factor. Moreover, using the out- virtual selection of molecules, based on a par- put of AutodockVina1.1.2, DeepVS achieves an ticular predefined criterion of potentially active AUC ROC of 0.81, which, to the best of our compounds against a determined pharmacolog- knowledge, is the best AUC reported so far for ical target.2,4,6 virtual screening using the 40 receptors from Two variants of this method can be adopted: DUD. ligand-based virtual screening and structure- based virtual screening. The first one deals with the similarity and the physicochemical analysis of active ligands to predict the activity of other compounds with similar characteristics. The 1 second is utilized when the three-dimensional Although this process can be effective to some structure of the target receptor was already elu- degree, the manual identification of characteris- cidated somehow (experimentally or computa- tics is a laborious and complex process and can tionally modeled). This approach is used to not be applied in large scale, resulting in the explore molecular interactions between possi- loss of relevant information, consequently lead- ble active ligands and residues of the binding ing to a set of features incapable of explaining site. Structure-based methods present a better the actual complexity of the problem.1,3,15,16 performance when compared to methods based On the other hand, recent work on Deep solely on the structure of the ligand focusing on Learning (DL), a family of ML approaches the identification of new compounds with ther- that minimizes feature engineering, has demon- apeutic potential.1,7–9 strated enormous success in different tasks from One of the computational methodologies ex- multiple fields.15–17 DL approaches normally tensively used to investigate these interactions learn features (representations) directly from is molecular docking.5,8,10 The selection of more the raw data with minimal or none human in- potent ligands using Docking-based Virtual tervention, which makes the resulting system Screening (DBVS) is made by performing the easier to adapt to new datasets. insertion of each compound from a compound In the last few years, DL is bringing the atten- library into a particular region of a target re- tion of the academic community and big phar- ceptor with elucidated 3D structure. In the first maceutical industries, as a viable alternative to stage of this process, a heuristic search is carried aid in the discovery of new drugs. One of the out in which thousands of possibilities of inser- first work in which DL was successfully applied tions are regarded. In the second, the quality solved problems related to QSAR (Quantita- of the insertion is described via some mathe- tive Structure-Activity Relationships) by Merck matical functions (scoring functions) that give in 2012. Some years later, Dahl et al.18 de- a clue about the energy complementarity be- veloped a multi-task deep neural network to tween compound and target.10,11 The last phase predict biological and chemical properties of a became a challenge to the computational scien- compound directly from its molecular structure. tists considering that it is easier to recover the More recently, Multi-task deep neural networks proper binding mode of a compound within an were employed to foresee the active-site directed active site than to assess a low energy score to pharmacophore and toxicity.19,20 Also in 2015, a determined pose. This hurdle constitutes a Ramsundar et al.21 predicted drug activity us- central problem to the docking methodology.12 ing Massively Multitask Neural Networks asso- Systems based on machine learning (ML) ciated with fingerprints. The relevance of DL have been successfully used to improve the out- is also highlighted by recent applications to the come of Docking-based Virtual Screening for discovery of new drugs and the determination of both, increasing the performance of score func- their characteristics such as Aqueous Solubility tions and constructing binding affinity classi- prediction22 and Fingerprints.23 fiers.1,3 The main strategies used in virtual In this work, we propose an approach based screening are neural networks - NN,13 support on Deep Convolutional Neural Networks to im- vector machines - SVM12 and random forest - prove docking-based virtual screening. The RF.14 One of the main advantages of employing method uses docking simulation results as in- ML is the capacity to explain the non-linear de- put to a Deep Neural Network, DeepVS from pendence of the molecular interactions between now on, which automatically learns to extract ligand and receptor.3 relevant features from basic data such as com- Traditional Machine Learning strategies de- pound atom types, atomic partial charges, and pend on how the data is presented. For ex- the distance between atoms. DeepVS learns ab- ample, in the virtual screening approaches, sci- stract features that are suitable to discriminate entists normally analyze the docking output to between active ligands and decoys in a protein- generate or extract human engineered features. compound complex. To the best of our knowl- 2 edge, this work is the first on using deep learn- convolutional layer is employed to summarize ing to improve docking-based virtual screening. information from all contexts from all atoms Recent works on improving docking-based vir- and generate a distributed vector representa- tual screening have only used traditional shal- tion of the protein-compound complex r (sec- low neural networks with human defined fea- ond hidden layer). Finally, in the last layer tures.7,13,24 (output layer), the representation of the com- We evaluated DeepVS on the Directory of plex is given as input to softmax classifier, Useful Decoys, which contains 40 different re- which is responsible for producing the score. In ceptors. In our experiments we used the output Algorithm 1, we present a high-level pseudo- of two docking programs: AutodockVina1.1.2 code with the steps of the feedforward process and Dock6.6. DeepVS outperformed the dock- executed by DeepVS. ing programs in both AUC ROC and enrich- ment factor. Moreover, when compared to the Atom context results of other systems previously reported in the literature, DeepVS achieved the state-of- First, it is necessary to perform some basic pro- the-art AUC of 0.81. cessing on the protein-compound complex to The main contributions of this work are: extract the input data for DeepVS. The in- (1) Proposition of a new deep learning-based put layer uses information from the context of approach that achieves state-of-the-art perfor- each atom in the compound. The context of an mance on docking-based virtual screening; (2) atom “a” is defined by a set of basic features Introduction of the concept of atom and amino extracted from its neighborhood. This neigh- acid embeddings, which can also be used in borhood comprises the kc atoms in the com- other deep learning applications for computa- pound closest to “a” (including itself) and the tional biology; (3) Proposition of an effective kp atoms in the protein that are closest to “a”, method to create distributed vector represen- where kc and kp are hyperparameters that must tations of protein-compound complexes that be defined by the user. The idea of using in- models the compound as a set of atom con- formation from closest neighbor atoms of both texts that is further processed by a convolu- compound and protein have been successfully tional layer.