<<

bioRxiv preprint doi: https://doi.org/10.1101/2020.06.09.143289; this version posted June 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Navigating Chemical Space By Interfacing Generative Artificial Intelligence and Molecular

Ziqiao Xua, Orrette Wauchopeb, and Aaron T. Franka,c athe University of Michigan, Department of Chemistry, Ann Arbor, 48109, United States of America; bCity University of New York, Baruch College, Department of Natural Sciences, New York, 10010, United States of America; cthe University of Michigan, Biophysics, Ann Arbor, 48109, United States of America

This manuscript was compiled on June 10, 2020

Here we report the testing and application of a simple, structure-aware framework to design target-specific screening libraries for drug development. Our approach combines advances in generative artificial intelligence (AI) with conventional molecular docking to rapidly explore chemical space conditioned on the unique physiochemical properties of the active site of a biomolecular target. As a proof-of- concept, we used our framework to construct a focused library for cyclin-dependent kinase type-2 (CDK2). We then used it to rapidly generate a library specific to the active site of the main protease (Mpro) of the SARS-CoV-2 virus, which causes COVID-19. By comparing approved and experimental drugs to compounds in our library, we also identified six drugs, namely, Naratriptan, Etryptamine, Panobinostat, Procainamide, Sertraline, and Lidamidine, as possible SARS-CoV-2 Mpro targeting compounds and, as such, potential drug repurposing candidates. To complement the open-science COVID-19 initiatives, we make our SARS-CoV-2 Mpro library fully accessible to the research community (https://github.com/atfrank/SARS-CoV-2).

Virtual Screening Libraries Biophysical Modeling Computational Drug Discovery | |

Developing new drugs is costly and time- Fortunately, advances in generative artificial intel- consuming; by some estimates, bringing a new drug ligence now make it possible to design both chemi- to market requires 13 years at the cost of US$1 cally novel and synthetically feasible compounds in ≥ ≥ billion.(1) As such, there is a keen interest in accel- silico.(18–23) Noteworthy among these frameworks erating the drug development pipeline and reduce are those based on autoencoders. Autoencoders its cost. In principle, computer-aided drug devel- are unsupervised machine learning models that, by opment can both accelerate and lower the cost of virtue of their architecture, are able to learn a com- identifying viable drug candidates.(2) For instance, pact, latent space representation of chemical space during hit identification, where the goal is to identify when they are trained with molecular data. The task small that may bind to the target, one of designing target-specific libraries can, therefore, could restrict testing to only the compounds that be cast as one of exploring the latent space of such are predicted to bind to the target with high anity. models, conditioned on the unique properties of the However, the computational cost of screening large active site on the target of interest. Currently lack- chemical libraries can still be burdensome. The com- ing, however, are ecient, structure-aware strategies putation cost of could be reduced for sampling the latent space of generative molec- by working with smaller and more focused, target- ular models. If developed, such techniques could specific virtual libraries, enriched in compounds that find utility in constructing active-site and target- are likely to bind to a specific site on the target of specific virtual libraries that are likely to contain hit interest.(3) Such libraries can be constructed by car- compounds that, if synthesized, might bind to the rying out de novo design, during which compounds targeted site. More immediately, they could be used are designed on-the-fly, guided by the unique physio- to construct a ligand-based pharmacophore model chemical properties of the active site of the target.(4– that can be used to screen millions of compounds in 17) However, because of the diculty in formalizing a matter of seconds.(24) diverse reaction rules and implementing robust algo- Here, we tested and implemented a novel, rithms to intelligently apply them, up to this point, structure-based framework for constructing target- such de novo design frameworks often produce syn- specific libraries by sampling the latent space of gen- thetically inaccessible compounds. erative machine learning models conditioned on the

1 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.09.143289; this version posted June 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

SAMPLE DOCK O NH O NH O a d O O Chemical NH NH N NH NH Space N N NH N N N

NH N b Tanimoto = 0.7491 Tanimoto = 0.7208 Tanimoto = 0.7101 PDB ID: 2R3I PDB ID: 2VTI Scaffold DOCK PDB ID: 2EXM DrugBank ID: DB08534 DrugBank ID: DB08133 (benzene) Design 1 (Target) DrugBank ID: DB08768

Score N N SAMPLE DOCK Filter Best O Design 2 N (VAE) (Target) Design NH NH N N O NH NH O DOCK Design n Tanimoto = 0.6826 (Target) Tanimoto = 0.6997 PDB ID: 1VYZ PDB ID: 1VYW DrugBank ID: DB02647 DrugBank ID: DB06944

4 N c 1.0 O N

0.8 O 3

Increasing Increasing NH NH

Drug-likeness 0.6 Synthetic feasibility

2 N N N Density Density N N 0.4 N N N 1 O NH 0.2

0 Tanimoto = 0.6418 0.0 Tanimoto = 0.6835 Tanimoto = 0.6533 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0 1 2 3 4 5 PDB ID: 2R3N PDB ID: 2BHH PDB ID: 2R3P QED SAS DrugBank ID: DB08539 DrugBank ID: DB03583 DrugBank ID: DB07065

Fig. 1. Illustration of (a) the “sample-and-dock” concept that can be used to implement (b) a structure-aware design pipeline. (c) Computed properties of compounds design to target the active site of CDK2. (d) Representative compounds in our sample-and-dock library that closely resemble known inhibitors of CDK2. Highlighted are sub-substructures that are identical to those found in the most similar compound in our sample-and-dock virtual library.

unique physiochemical properties of the “active site” We chose CDK2 because it is an important thera- of a target. We reasoned that we could implement peutic target that has been extensively studied bio- such a pipeline by combining generative molecular chemically and structurally, and for which a large machine learning models with molecular computer library of known inhibitors is readily accessible.(25) docking algorithms. For instance, one could start by To achieve this, we first interfaced the junction-tree embedding a reference compound, e.g., benzene, into variational autoencoder (JTVAE)(26)withthedock- the latent space of a generative autoencoder model ing program, rDock.(27) JTVAE was chosen because and sample latent space points in its “neighborhood.” it has demonstrated a superior ability to generate syn- The sampled points can then be decoded into their thetically feasible molecules, and rDock was chosen corresponding 3-dimensional (3D) representations because it is fast, and the software is open-sourced. and then docked onto the target (Figure 1a). Of We then applied our JTVAE-rDock sample-and-dock these sampled compounds, the one that is predicted pipeline (Figure 1a and b) to CDK2. For this CDK2 to interact most favorably with the active site can be test case, we initiated sample-the-dock pipeline by identified and then used as the new reference. The first mapping the CDK2 active site using the crys- process can be repeated (Figure 1b), and in so do- tal structure of CDK2 complexed with Roniciclib ing, the latent-space of a generative molecular model (PDB: 5IEV).(28) The JTVAE-rDock sample-and- (and, so chemical space) explored in a semi-directed dock pipeline was then initialized using benzene as manner, conditioned on the properties of the active a starting scaold. We chose to use benzene as the site of the target. The compounds sampled during starting scaold because it is small and a chemi- this iterative (“sample-and-dock”) process could then cal motif that is ubiquitous in known drugs. Each be used to construct a target-specific virtual library. sample-and-dock cycle consisted of sampling and First, we implemented a sample-and-dock pipeline docking 20, unique compounds. For each compound, and then used it to construct a target-specific library 100 poses were generated during docking. The com- for the cyclin-dependent kinase 2 (CDK2) protein. pound with the highest predicted binding anity

2 | Xu et al. bioRxiv preprint doi: https://doi.org/10.1101/2020.06.09.143289; this version posted June 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

a

P-III P-III P-III P-II P-II P-II

P-IV P-IV P-IV

P-V P-V P-V P-I P-I P-I

SD-2020-1 SD-2020-2 SD-2020-3 Docking Score: -30.66 kcal/mol Docking Score: -27.10 kcal/mol Docking Score: -28.36 kcal/mol Drug-likesness: 0.51 Drug-likesness: 0.59 Drug-likesness: 0.75 b

2.81 Å His41 2.70 Å 164 3.15 Å His 2.89 Å

His164 2.70 Å 2.73 Å 3.08 Å 2.55 Å 2.69 Å 3.14 Å 2.82 Å Cys145 2.68 Å Cys145 3.11 Å 3.18 Å Ser144 Leu141 Leu141 Leu141

Ser144 Ser144

c NH 2 N

O NH HN N S H O NH N H O

Tanimoto: 0.65 Tanimoto: 0.64 NH Tanimoto: 0.61 HO Naratriptan Etryptamine Panobinostat DrugBank ID: DB00952 DrugBank ID: DB01546 DrugBank ID: DB06603

Cl O O O NH N 2 O N Cl H N N N N H H H N NH 2 H Tanimoto: 0.52 NH 2 Lidamidine Tanimoto: 0.56 Tanimoto: 0.56 Tanimoto: 0.54 Procainamide Parsalmide Sertraline DrugBank ID: DB01035 DrugBank ID: DB01104

Fig. 2. (a) Examples of three potential inhibitors (SD-2020-1, -2, and -3) designed using the sample-and-dock approach. Shown are the predicted poses of these compounds, docked into the active site of SARS-CoV-2 Mpro. (b) Ligand interaction maps of the docked poses. Docking predicts that the SD-2020-1 and SD-2020-3 interact with the catalytic residue Cys145, and SD-2020-2 interacts with the catalytic residue His41. (c) Known drugs that were most similar to compounds in our sample-and-docking Mpro library. Highlighted are sub-substructures that are identical to those found in the most similar compound in the virtual library.

was selected and used as the input scaold for the synthetic accessibility score, SAS). Though there next cycle of sample-and-dock. are exceptions, compounds suitable for initializing To explore the quality of the compounds sampled drug-development typically exhibit QED and SAS 1.0, respectively. For CDK2, the sample-and-dock during sample-and-dock, we computed the distri- æ bution of their drug-likeness (estimated using the molecules exhibited mean QED and SAS values of 0.80 and 2.5, respectively (Figure 1c). To exam- quantitative estimate of drug-likeness, QED,(29) ≥ score) and synthetic feasibility (estimated using their ine whether the compounds explored during sample-

Xu et al. bioRxiv preprint doi: https://doi.org/10.1101/2020.06.09.143289; this version posted June 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

and-dock resemble known inhibitors of CDK2, we lack of the reactive peptidyl moiety, make SD-2020- computed their chemical similarity to known CDK2 3 an interesting lead candidate. On the basis of inhibitors. Briefly, we identified known CDK2 in- the results for CDK2, we speculate that the SARS- hibitors within the DrugBank,(25) and calculated CoV-2 Mpro specific library we generated using our the Tanimoto coecient between the chemical finger- sample-and-dock approach may contain yet to be prints of the known inhibitors and the sample-and- explored and promising lead candidates. Accord- dock compounds. Remarkably, despite initializing ingly, we make the library of Mpro designs fully open sample-and-dock with just benzene, we found that to the scientific community with the hope that it we were indeed able to sample designs that closely will be explored by medicinal chemists to seed the resemble known CDK2 inhibitors (Figure 1d). Col- design of novel drugs targeting Mpro of SAR-CoV-2 lectively, therefore, our results for CDK2 suggest that (https://github.com/atfrank/SARS-CoV-2). starting from available structure and knowledge of Finally, we asked whether the compounds in our the binding site, generative machine learning models virtual SAR-CoV-2 Mpro library were similar to any and molecular docking could be seamlessly inter- known or experimental drugs; identifying such drugs faced to rapidly construct a target-specific screening would be one way we could immediately leverage library comprised of compounds that are predicted to our virtual library to assist ongoing COVID-19 drug be both drug-like and synthetically feasible and are repurposing eorts. To answer this question, we likely to resemble known small modulators compared the compounds in our library with com- of the target. pounds in SuperDRUG2,(33) a library comprised of FDA approved and experimental drugs. Shown in Next, we applied our sample-and-dock framework Figure 2c are the six drugs that exhibited the highest to the main protease (Mpro) of the virus, SARS- chemical similarity to at least one compound in our CoV-2. SARS-CoV-2 Mpro processes polypeptides SAR-CoV-2 Mpro library. Of these six, three are that are important for viral assembly and, as such, central nervous system drugs, which include Sertra- is an attractive target for developing drugs to treat line, a selective serotonin reuptake inhibitor (SSRI). COVID-19.(30, 31) To generate a SARS-CoV-2 Mpro- Intriguingly, another SSRI, fluvoxamine, has recently specific library, we initiated sample-and-dock calcu- been identified as a potential COVID-19 drug and is lations starting with benzene as the reference scaf- now in clinical trials. fold and targeting the same site on SARS-CoV-2 Mpro that is occupied by the inhibitor, N3 (PDB: To summarize, we combined emerging generative 6Y2F).(32) The active site of SARS-CoV-2 Mpro is modeling with well-established biophysical model- comprised of five sub-pockets, which we denote as ing to yield a hybrid approach for the structure- P-I, P-II, P-III, P-IV, and P-V (Figure 2a). Over guided exploration of chemical space. Instead of 48 hours, we generated a virtual library contain- exploring chemical space in a large and fixed library, ing 1448 compounds. Shown in Figure 2 are three our approach facilitates a directed exploration of representative compounds within our SARS-CoV- the chemical space that is relevant to a specific ac- 2Mpro library. Inspection of the predicted poses tive site on a given target. The sample-and-dock indicate that the three representative compounds pipeline we implemented is also modular and flexi- occupied pockets P-I, P-II, and P-IV (Figure 2a), ble. For instance, as more robust generative models and could form stabilizing contacts with the cat- are developed, they can be trivially incorporated alytic residues, His41 and Cys145 (Figure 2b) Inter- into our pipeline, as can more accurate and robust estingly, both SD-2020-1 (QED=0.51) and SD-2020-2 scoring functions and methods. In its current form, (QED=0.59) contain the peptidyl moiety found in however, our sample-and-dock pipeline can be de- known Mpro inhibitors(30, 32). In fact, the pres- ployed to construct structure- and target-specific ence of the peptidyl moiety was a common feature maps of the “privileged” chemical space for thera- of many of the compounds that were sampled dur- peutically relevant protein and nucleic acid targets. ing sample-and-dock. On the other hand, SD-2020- Applying unsupervised learning to sets of structure- 3 (QED=0.75), a nucleotide-analog, is an example and target-specific chemical-space maps could facili- of one of the compounds in our library that lacks tate the emergence of general design principles that the peptidyl moiety. The overall drug-likeness, and can guide the rational structure-guided design of

4 | Xu et al. bioRxiv preprint doi: https://doi.org/10.1101/2020.06.09.143289; this version posted June 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

biomolecular probe compounds.(34) Accordingly, we each cycle, the seeding molecule was taken in the form have implemented the sample-and-dock pipeline into of SMILES strings and converted into one-hot encod- an easy-to-use software we refer to as SampleDock ing according to a set of predefined vocabulary. The and make it accessible to the research community at one-hot encoding was first transformed into a set of two 28-dimensional vectors by the encoders of JTVAE as lo- (https://smaltr.org/). cations in the two latent spaces. The latent vectors were then re-sampled and reconstructed (decoded to SMILES Supporting Information (SI). strings) 20 times by the decoders to generate 20 unique molecules that resemble the seed. For each generated Software. The SampleDock software can be access at molecule, the SMILES string was converted to a 3D struc- https://smaltr.org/. ture using RDKit and then docked into the active site on the target using rDock. Across all the generated molecules Data sets. The sample-and-dock library for SARS- within the cycle, the designed molecule with the highest CoV-2 Mpro (as well as other COVID-19 target) can , as estimated using the rDock scoring function, was Ai be access at https://github.com/atfrank/SARS-CoV-2. used to as the seed molecule for the next cycle. The pro- cess is repeated over many cycles, and the best compound from each cycle used to construct the final library. Materials and Methods Target-Specific Libraries. We used our sample-and-dock Sample-and-Dock. To construct structure-aware and pipeline to construct target-specific libraries for CDK2 target-specific virtual libraries, we implemented a con- pro ditional generative sampling scheme. Our conditional and SAR-CoV-2 M . For CDK2, sample-and-dock was generative sampling scheme, which we refer to as sample- carried out using the crystal structure of CDK2 bound and-dock, combines a pre-trained generative variational with Roniciclib (PDB ID: 5IEV).(28) For SAR-CoV-2 autoencoder (VAE) model with molecular docking (see Mpro, we used the crystal structure Mpro bound to N3 below). In sample-and-dock, the exploration of the la- (PDB ID: 6Y2F).(32) From the crystal structures, the lig- tent space of a pre-trained generative model is made ands and receptors separated and saved as separate coor- conditional on the unique physiochemical and biophysical dinate files using PyMOL. To map the active sites of each properties of the binding site on a target, ,byusing T target, cavity mapping was carried out using the reference- a greedy algorithm that optimizes a cost function that dependent of the properties of the target. Here we use ligand method that is implemented in the rbcavity tool molecular docking scoring function as the cost function in from rDock. For the reference-ligand method, the radius our greedy algorithm. Accordingly, given a set of latent- of the overlapping sphere was set to 7.0 Å, and the radius space samples, the “best” is selected as the one that, when of the small probe radii was set to 1.5 Å. For both CDK2 decoded into its molecular representation, yields the high- and Mpro, sample-and-dock was initialized using benzene est binding anities ( ), as estimated using a molecular Ai and ran for 48 hours. docking scoring function. 1. SM Paul, et al., How to improve r&d productivity: the pharmaceutical industry’s grand chal- Briefly, to construct a customized target-specific library lenge. Nat. reviews Drug discovery 9, 203–214 (2010). using this conditional sampling scheme, we first embed 2. H Willems, S De Cesco, F Svensson, Computational chemistry on a budget–supporting drug discovery with limited resources. J. Medicinal Chem. (2020). a reference compound in the latent space of a generative 3. DE Clark, Virtual screening: Is bigger always better? or can small be beautiful? J. Chem. model (mi zi) and then sample a set of zi+1 in the Inf. Model. (2020). æ { } 4. VJ Gillet, A Johnson, P Mata, S Sike, Automated structure design in 3d. Tetrahedron Comput. “neighborhood” of the reference z .The z can then i { i+1} Methodol. 3, 681–696 (1990). be decoded (zi+1 mi+1,D) and the resulting molecules 5. JB Moon, WJ Howe, Computer design of bioactive molecules: A method for receptor-based æ de novo ligand design. Proteins: Struct. Funct. Bioinforma. 11, 314–328 (1991). docked into the active site of . Using a greedy optimiza- T 6. V Gillet, AP Johnson, P Mata, S Sike, P Williams, Sprout: a program for structure generation. tion strategy, the “best” of these designed molecules can J. computer-aided molecular design 7, 127–153 (1993). be identified using by the docking-derived and then 7. HJ Böhm, The computer program ludi: a new method for the de novo design of enzyme Ai+1 inhibitors. J. computer-aided molecular design 6, 61–78 (1992). it is used as the new reference for the next round of sam- 8. SH Rotstein, MA Murcko, Groupbuild: a fragment-based method for de novo . J. medicinal chemistry 36, 1700–1710 (1993). ple and dock. The process is then repeated over many 9. DE Clark, et al., Pro_ligand: An approach to de novo molecular design. 1. application to the cycles to explore the chemical space in a semi-directed design of organic molecules. J. computer-aided molecular design 9, 13–32 (1995). manner and conditioned on the properties of . 10. RS DeWitte, EI Shakhnovich, Smog: de novo design method based on simple, fast, and T accurate free energy estimates. 1. methodology and supporting evidence. J. Am. Chem. Soc. 118, 11733–11744 (1996). 11. RS DeWitte, AV Ishchenko, EI Shakhnovich, Smog: de novo design method based on simple, Implementation. Using a set of in-house python scripts, fast, and accurate free energy estimates. 2. case studies in molecular design. J. Am. Chem. we implemented a sample-and-dock pipeline that inter- Soc. 119, 4608–4617 (1997). faced with the junction-tree-variation autoencoder (JT- 12. N Cheron, N Jasty, EI Shakhnovich, Opengrowth: an automated and rational algorithm for finding new protein ligands. J. medicinal chemistry 59, 4171–4188 (2015). VAE) model of Jin and coworkers,(26) with the molecular 13. L Hoffer, C Muller, P Roche, X Morelli, Chemistry-driven hit-to-lead optimization guided by structure-based approaches. Mol. informatics 37, 1800059 (2018). docking program rDock.(27). All instances of sample- 14. F Daeyaert, MW Deem, A pareto algorithm for efficient de novo design of multi-functional and-dock were initialized using benzene. At the start of molecules. Mol. informatics 36, 1600044 (2017).

Xu et al. bioRxiv preprint doi: https://doi.org/10.1101/2020.06.09.143289; this version posted June 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

15. RJ Bienstock, Computational methods for fragment-based ligand design: growing and linking in Fragment-Based Methods in Drug Discovery. (Springer), pp. 119–135 (2015). 16. H Li, KS Leung, CH Chan, HL Cheung, MH Wong, isyn: Webgl-based interactive de novo drug design in 2014 18th International Conference on Information Visualisation. (IEEE), pp. 302–307 (2014). 17. J Diharce, M Cueto, M Beltramo, V Aucagne, P Bonnet, In silico peptide ligation: Itera- tive residue docking and linking as a new approach to predict protein-peptide interactions. Molecules 24, 1351 (2019). 18. D Merk, L Friedrich, F Grisoni, G Schneider, De novo design of bioactive small molecules by artificial intelligence. Mol. informatics 37, 1700153 (2018). 19. A Gupta, et al., Generative recurrent networks for de novo drug design. Mol. informatics 37, 1700111 (2018). 20. R Gómez-Bombarelli, et al., Automatic chemical design using a data-driven continuous rep- resentation of molecules. ACS central science 4, 268–276 (2018). 21. X Liu, K Ye, HW van Vlijmen, AP IJzerman, GJ van Westen, An exploration strategy improves the diversity of de novo ligands using deep reinforcement learning: a case for the adenosine a 2a receptor. J. 11, 35 (2019). 22. G Schneider, DE Clark, Automated de novo drug design–“are we nearly there yet?”. Ange- wandte Chemie (2019). 23. T Blaschke, M Olivecrona, O Engkvist, J Bajorath, H Chen, Application of generative autoen- coder in de novo molecular design. Mol. informatics 37, 1700123 (2018). 24. DR Koes, CJ Camacho, Zincpharmer: pharmacophore search of the zinc database. Nucleic acids research 40, W409–W414 (2012). 25. DS Wishart, et al., Drugbank 5.0: a major update to the drugbank database for 2018. Nucleic acids research 46, D1074–D1082 (2018). 26. W Jin, R Barzilay, T Jaakkola, Junction tree variational autoencoder for molecular graph generation. arXiv preprint arXiv:1802.04364 (2018). 27. S Ruiz-Carmona, et al., rdock: a fast, versatile and open source program for docking ligands to proteins and nucleic acids. PLoS computational biology 10, e1003571 (2014). 28. P Ayaz, et al., Conformational adaption may explain the slow dissociation kinetics of roniciclib (bay 1000394), a type i cdk inhibitor with kinetic selectivity for cdk2 and cdk9. ACS chemical biology 11, 1710–1719 (2016). 29. GR Bickerton, GV Paolini, J Besnard, S Muresan, AL Hopkins, Quantifying the chemical beauty of drugs. Nat. chemistry 4, 90 (2012). 30. K Anand, J Ziebuhr, P Wadhwani, JR Mesters, R Hilgenfeld, Coronavirus main proteinase (3clpro) structure: basis for design of anti-sars drugs. Science 300, 1763–1767 (2003). 31. B Goyal, D Goyal, Targeting the dimerization of main protease of coronaviruses: A potential broad-spectrum therapeutic strategy. ACS Comb. Sci. (2020). 32. L Zhang, et al., Crystal structure of sars-cov-2 main protease provides a basis for design of improved –-ketoamide inhibitors. Science (2020). 33. VB Siramshetty, et al., Superdrug2: a one stop resource for approved/marketed drugs. Nu- cleic acids research 46, D1137–D1143 (2018). 34. JL Reymond, R Van Deursen, LC Blum, L Ruddigkeit, Chemical space as a source for new drugs. MedChemComm 1, 30–38 (2010).

6 | Xu et al. bioRxiv preprint doi: https://doi.org/10.1101/2020.06.09.143289; this version posted June 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

Supporting Information: Navigating Chemical Space By Interfacing Generative Artificial Intelligence and Molecular Docking

Ziqiao Xua, Orrette Wauchopeb, and Aaron T. Franka,c aUniversity of Michigan, Department of Chemistry, Ann Arbor, 48109, United States of America; bCity University of New York, Baruch College, Department of Chemistry, New York, 10010, United States of America; cUniversity of Michigan, Department of Biophysics, Ann Arbor, 48109, United States of America

H H H H N a b c d 2 OH N N N Cbz OMs Cbz N3 Cbz NH2

O O Ref: Eur. J. Med. Chem. (2011), 46, 5769 H N e H N Cbz N N 2 N N H H H H

Reagents and Conditions: a) benzylcarbonyl chloride, TEA, CH2Cl2, 0 ; methanesulfonyl chloride, TEA, CH2Cl2, 0 ; b) NaN3, DMSO, 70 , 2h; c) PPh3, MeOH, H2O, reflux, 2h; d) phenyl isocyanate, THF, rt, 2h

O NH2 O O H H NH O f N N g H H H2N N N N N N N H H N N H H NH2 O H H NCO O SD-2020-1

Adapted from Ref: ACS Med. Chem. Lett. (2010), 1, 9, 460

NH2 NH2

Fig. S1. Proposed synthetic scheme for SD-2020-1.

S1 bioRxiv preprint doi: https://doi.org/10.1101/2020.06.09.143289; this version posted June 11, 2020. The copyright holder for this preprint (which was not certified by peer review) is the author/funder, who has granted bioRxiv a license to display the preprint in perpetuity. It is made available under aCC-BY-NC-ND 4.0 International license.

NH2 + F O N N F B F NH2 O abs F HN N NH2 HN

CO2Et N N OH

Adapted from Heterocycles 2009, 74, 4, 923 SD-2020-2

Fig. S2. Proposed synthetic scheme for SD-2020-2.

H H H H N a b c d 2 OH N N N Cbz OMs Cbz N3 Cbz NH2

O O Adapted from Ref: Eur. J. Med. Chem. (2011), 46, 5769 H N e H N Cbz N N 2 N N H H H H

Reagents and Conditions: a) benzylcarbonyl chloride, TEA, CH2Cl2, 0 ; methanesulfonyl chloride, TEA, CH2Cl2, 0 ; b) NaN3, DMSO, 70 , 2h; c) PPh3, MeOH, H2O, reflux, 2h; d) methyl isocyanate, THF, rt, 2h

N

O N f H N 2 N N H H N HN N O HN HN HN O OCN SD-2020-2 Adapted from Ref: ACS Med. Chem. Lett. (2010), 1, 9, 460 N

N

H2N

Fig. S3. Proposed synthetic scheme for SD-2020-3.

S2 | Xu et al.