Navigating Chemical Space by Interfacing Generative Artificial Intelligence and Molecular Docking

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.09.143289; this version posted June 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. Navigating Chemical Space By Interfacing Generative Artificial Intelligence and Molecular Docking Ziqiao Xua, Orrette Wauchopeb, and Aaron T. Franka,c athe University of Michigan, Department of Chemistry, Ann Arbor, 48109, United States of America; bCity University of New York, Baruch College, Department of Natural Sciences, New York, 10010, United States of America; cthe University of Michigan, Biophysics, Ann Arbor, 48109, United States of America This manuscript was compiled on June 10, 2020 Here we report the testing and application of a simple, structure-aware framework to design target-specific screening libraries for drug development. Our approach combines advances in generative artificial intelligence (AI) with conventional molecular docking to rapidly explore chemical space conditioned on the unique physiochemical properties of the active site of a biomolecular target. As a proof-of- concept, we used our framework to construct a focused library for cyclin-dependent kinase type-2 (CDK2). We then used it to rapidly generate a library specific to the active site of the main protease (Mpro) of the SARS-CoV-2 virus, which causes COVID-19. By comparing approved and experimental drugs to compounds in our library, we also identified six drugs, namely, Naratriptan, Etryptamine, Panobinostat, Procainamide, Sertraline, and Lidamidine, as possible SARS-CoV-2 Mpro targeting compounds and, as such, potential drug repurposing candidates. To complement the open-science COVID-19 drug discovery initiatives, we make our SARS-CoV-2 Mpro library fully accessible to the research community (https://github.com/atfrank/SARS-CoV-2). Virtual Screening Libraries Biophysical Modeling Computational Drug Discovery | | Developing new drugs is costly and time- Fortunately, advances in generative artificial intel- consuming; by some estimates, bringing a new drug ligence now make it possible to design both chemi- to market requires 13 years at the cost of US$1 cally novel and synthetically feasible compounds in ≥ ≥ billion.(1) As such, there is a keen interest in accel- silico.(18–23) Noteworthy among these frameworks erating the drug development pipeline and reduce are those based on autoencoders. Autoencoders its cost. In principle, computer-aided drug devel- are unsupervised machine learning models that, by opment can both accelerate and lower the cost of virtue of their architecture, are able to learn a com- identifying viable drug candidates.(2) For instance, pact, latent space representation of chemical space during hit identification, where the goal is to identify when they are trained with molecular data. The task small molecules that may bind to the target, one of designing target-specific libraries can, therefore, could restrict testing to only the compounds that be cast as one of exploring the latent space of such are predicted to bind to the target with high affinity. models, conditioned on the unique properties of the However, the computational cost of screening large active site on the target of interest. Currently lack- chemical libraries can still be burdensome. The com- ing, however, are efficient, structure-aware strategies putation cost of virtual screening could be reduced for sampling the latent space of generative molec- by working with smaller and more focused, target- ular models. If developed, such techniques could specific virtual libraries, enriched in compounds that find utility in constructing active-site and target- are likely to bind to a specific site on the target of specific virtual libraries that are likely to contain hit interest.(3) Such libraries can be constructed by car- compounds that, if synthesized, might bind to the rying out de novo design, during which compounds targeted site. More immediately, they could be used are designed on-the-fly, guided by the unique physio- to construct a ligand-based pharmacophore model chemical properties of the active site of the target.(4– that can be used to screen millions of compounds in 17) However, because of the difficulty in formalizing a matter of seconds.(24) diverse reaction rules and implementing robust algo- Here, we tested and implemented a novel, rithms to intelligently apply them, up to this point, structure-based framework for constructing target- such de novo design frameworks often produce syn- specific libraries by sampling the latent space of gen- thetically inaccessible compounds. erative machine learning models conditioned on the 1 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.09.143289; this version posted June 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. SAMPLE DOCK O NH O NH O a d O O Chemical NH NH N NH NH Space N N NH N N N NH N b Tanimoto = 0.7491 Tanimoto = 0.7208 Tanimoto = 0.7101 PDB ID: 2R3I PDB ID: 2VTI Scaffold DOCK PDB ID: 2EXM DrugBank ID: DB08534 DrugBank ID: DB08133 (benzene) Design 1 (Target) DrugBank ID: DB08768 Score N N SAMPLE DOCK Filter Best O Design 2 N (VAE) (Target) Design NH NH N N O NH NH O DOCK Design n Tanimoto = 0.6826 (Target) Tanimoto = 0.6997 PDB ID: 1VYZ PDB ID: 1VYW DrugBank ID: DB02647 DrugBank ID: DB06944 4 N c 1.0 O N 0.8 O 3 Increasing Increasing NH NH Drug-likeness 0.6 Synthetic feasibility 2 N N N Density Density N N 0.4 N N N 1 O NH 0.2 0 Tanimoto = 0.6418 0.0 Tanimoto = 0.6835 Tanimoto = 0.6533 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1 2 3 4 5 PDB ID: 2R3N PDB ID: 2BHH PDB ID: 2R3P QED SAS DrugBank ID: DB08539 DrugBank ID: DB03583 DrugBank ID: DB07065 Fig. 1. Illustration of (a) the “sample-and-dock” concept that can be used to implement (b) a structure-aware design pipeline. (c) Computed properties of compounds design to target the active site of CDK2. (d) Representative compounds in our sample-and-dock library that closely resemble known inhibitors of CDK2. Highlighted are sub-substructures that are identical to those found in the most similar compound in our sample-and-dock virtual library. unique physiochemical properties of the “active site” We chose CDK2 because it is an important thera- of a target. We reasoned that we could implement peutic target that has been extensively studied bio- such a pipeline by combining generative molecular chemically and structurally, and for which a large machine learning models with molecular computer library of known inhibitors is readily accessible.(25) docking algorithms. For instance, one could start by To achieve this, we first interfaced the junction-tree embedding a reference compound, e.g., benzene, into variational autoencoder (JTVAE)(26)withthedock- the latent space of a generative autoencoder model ing program, rDock.(27) JTVAE was chosen because and sample latent space points in its “neighborhood.” it has demonstrated a superior ability to generate syn- The sampled points can then be decoded into their thetically feasible molecules, and rDock was chosen corresponding 3-dimensional (3D) representations because it is fast, and the software is open-sourced. and then docked onto the target (Figure 1a). Of We then applied our JTVAE-rDock sample-and-dock these sampled compounds, the one that is predicted pipeline (Figure 1a and b) to CDK2. For this CDK2 to interact most favorably with the active site can be test case, we initiated sample-the-dock pipeline by identified and then used as the new reference. The first mapping the CDK2 active site using the crys- process can be repeated (Figure 1b), and in so do- tal structure of CDK2 complexed with Roniciclib ing, the latent-space of a generative molecular model (PDB: 5IEV).(28) The JTVAE-rDock sample-and- (and, so chemical space) explored in a semi-directed dock pipeline was then initialized using benzene as manner, conditioned on the properties of the active a starting scaffold. We chose to use benzene as the site of the target. The compounds sampled during starting scaffold because it is small and a chemi- this iterative (“sample-and-dock”) process could then cal motif that is ubiquitous in known drugs. Each be used to construct a target-specific virtual library. sample-and-dock cycle consisted of sampling and First, we implemented a sample-and-dock pipeline docking 20, unique compounds. For each compound, and then used it to construct a target-specific library 100 poses were generated during docking. The com- for the cyclin-dependent kinase 2 (CDK2) protein. pound with the highest predicted binding affinity 2 | Xu et al. bioRxiv preprint doi: https://doi.org/10.1101/2020.06.09.143289; this version posted June 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license. a P-III P-III P-III P-II P-II P-II P-IV P-IV P-IV P-V P-V P-V P-I P-I P-I SD-2020-1 SD-2020-2 SD-2020-3 Docking Score: -30.66 kcal/mol Docking Score: -27.10 kcal/mol Docking Score: -28.36 kcal/mol Drug-likesness: 0.51 Drug-likesness: 0.59 Drug-likesness: 0.75 b 2.81 Å His41 2.70 Å 164 3.15 Å His 2.89 Å His164 2.70 Å 2.73 Å 3.08 Å 2.55 Å 2.69 Å 3.14 Å 2.82 Å Cys145 2.68 Å Cys145 3.11 Å 3.18 Å Ser144 Leu141 Leu141 Leu141 Ser144 Ser144 c NH 2 N O NH HN N S H O NH N H O Tanimoto: 0.65 Tanimoto: 0.64 NH Tanimoto: 0.61 HO Naratriptan Etryptamine Panobinostat DrugBank ID: DB00952 DrugBank ID: DB01546 DrugBank ID: DB06603 Cl O O O NH N 2 O N Cl H N N N N H H H N NH 2 H Tanimoto: 0.52 NH 2 Lidamidine Tanimoto: 0.56 Tanimoto: 0.56 Tanimoto: 0.54 Procainamide Parsalmide Sertraline DrugBank ID: DB01035 DrugBank ID: DB01104 Fig.

Load more