1

Report on an NIH Workshop on Ultralarge Chemistry Wendy A. Warr

Wendy Warr & Associates, 6 Berwick Court, Holmes Chapel, Cheshire, CW4 7HZ, United Kingdom.

Email: [email protected]

Introduction The virtual workshop took place on December 1-3, 2020. It was aimed at researchers, groups, and companies that generate, manage, sell, search, and screen databases of more than one billion small molecules (Figure 1). There were about 550 “attendees” from 37 different countries. Recent advances in have enabled researchers to navigate virtual chemical spaces containing billions of chemical structures, carrying out similarity searches, studying structure-activity relationships (SAR), experimenting with scaffold-hopping, and using other methodologies.1 For clarity, one could differentiate “spaces” from “libraries”, and “libraries” from “databases”. Spaces are combinatorially constructed collections of compounds; they are usually very big indeed and it is not possible to enumerate all the precise chemical structures that are covered. Libraries are enumerated collections of full structures: usually fewer than 1010 molecules. Databases are a way to storing libraries, for example, in a relational management system.

Figure 1. Ultralarge chemical databases. (Source: Marcus Gastreich based on the publication by Hoffmann and Gastreich.)

This report summarizes talks from about 30 practitioners in the field of ultralarge collections of molecules. The aim is to represent as accurately as possible the information that was delivered by the speakers; the report does not seek to be evaluative.

2

Welcoming remarks; defining a drug discovery gateway Susan Gregurick, Office of Data Science and Strategy, NIH, USA

Data should be “findable, accessible, interoperable and reusable” (FAIR)2 and with this in mind, NIH has been creating, curating, integrating, and querying ultralarge chemistry databases. Ultimately, though, the aim is to compute on data and information, in order to find better, targeted therapeutics. The community already has industrial databases of building blocks, fragments, screening compounds, reagents, intermediates, and synthetic routes. We have algorithms to measure affinity and predict binding, and healthcare records and data on clinical trials. We will be able to collaborate and build platforms based on all this information. In order to develop these networks, we need large scale, computable metadata schema; persistent identifiers; 2D and 3D knowledge graphs and AI; and an ecosystem of high performance computing (HPC) and cloud computing enclaves. There has been a great deal of progress but there is yet more to do in order to refine the drug discovery gateway.

Making virtual REAL: an approach to access billions of make-on-demand compounds Yurii Moroz, Chemspace LLC, Kyiv, Ukraine

Chemical space is vast: estimates are around 1063 molecules. A major problem in rational is that compounds suggested by the software are often hard or impossible to synthesize. Enamine3 is very proud of the synthesis skills and publication record of its synthetic chemists; chemical knowledge is critical to the “make-on-demand” concept. The company has 240 million “MADE” building blocks which can be made on a gram scale and billions of Readily accessible (REAL)4 screening compounds “made” from running 195 validated synthetic procedures on 130,000 qualified building blocks.

Validation is a rigorous process, and reagents are scored on Enamine’s experience of how well they work in robust reactions. For example, if a reductive amination works well in 81% of 293 cases where a certain aldehyde is used, then that aldehyde gets a high score. If only 4% of 54 reductive aminations succeed with a certain aldehyde, then that aldehyde will be excluded from construction of the REAL database. The REAL compounds can be made by parallel synthesis on a mg scale using one-pot chemistry in 1-3 steps. Subsets of the REAL database have been made, for example, a subset of 1.36 billion druglike compounds that can be made by Enamine within 3-4 weeks with a success rate of about 80%. Price and delivery time for these compounds can be guaranteed.

The REAL space is 15.5 billion compounds. The actual compounds in the space are not enumerated. The REAL database of enumerated structures can be searched online in the Chemspace5 catalog (using NextMove’s Arthor software),6 or by Chemspace API or by using KNIME. ChemAxon’s MadFast7 is used to search the REAL database in EnamineStore. REAL use cases have been reported.8-10

The REAL space is too big to enumerate but it can be similarity-searched1 using BioSolveIT’s Feature Trees (FTrees)11 -style similarity search software, facilitating virtual high throughput screening. (FTrees is described in more detail below.) Scaffold-hopping is one of the strengths of FTrees. A recent use case has been reported.12 BioSolveIT’s infiniSee13 software is used to navigate the space. The size of is tremendous, and Enamine has explored only a small part of it, but has delivered a proven success rate in synthesis. 3

Searching for novel chemical hit matter in large chemical spaces Daniel Kuhn, Merck Healthcare KGaA, Darmstadt, Germany

Optimizing small molecule drugs is a multiparameter problem.14 The design, development, and synthesis of drugs has been learned by medicinal and computational chemists and honed after years of training and practice. Nowadays, pharma needs to design better drugs faster, therefore compound design and structure activity-relationship (SAR) analysis is moving from an art toward a process. Virtual screening in large chemical spaces is increasingly used to identify novel starting points for hit identification. Searching such a huge space to identify which compounds to make next is a big challenge.

Merck AcceSSible InVentory (MASSIV) is Merck’s in-house chemical space of synthetically accessible compounds. It is based on public and in-house chemical reactions in Merck’s electronic lab notebook, an ELN called ELAB, which acts as Merck’s internal knowledge sharing platform. ELAB reactions are classified by InfoChem’s CLASSIFY algorithm.15 The 106 building blocks for MASSIV are from eMolecules, Sigma-Aldrich, and Merck’s own collection. synthesis is carried out using validated reaction spaces resulting from the merger of public and in-house reactions. MiniMASSIV is a subset made by modifying one compound at one site in one reaction.

The MASSIV virtual space of 1020 molecules is similarity-searched using FTrees.11 Postfiltering is important in hit selection. Application of MASSIV virtual space searches in projects is combined with medicinal chemistry initiatives. Virtual screening, deep learning, , and binding activity prediction using free energy perturbation, FEP+,16 have been used in 14 Merck projects. Proof of concept for synthesis was achieved in six cases; synthesis is in underway in one case; actives have been found in six other cases.

Merck have learned a number of lessons from applying smart screening rather than hard screening. Ultralarge chemical spaces can provide interesting chemistry as starting point for hit identification. If you have dedicated parallel chemistry resources you can quickly follow up on the ideas. Out- sourcing to CROs (as Merck does) can be slow and expensive. Search in dedicated make-on-demand chemical spaces such as REAL Space is fast and cost-efficient.

Kuhn presented a proof-of-principle for in silico optimization of a fragment to a hit. A scaffold searched in REAL Space gave 903 ideas. These were reduced to 750 by 3D ROCS17 (shape similarity for virtual screening). Docking, and molecular mechanics with the generalized Born model and solvent accessibility method to elicit free energies (MM/GBSA) reduced the 750 to 400. A machine learning model for microsomal clearance reduced the 400 to 250 ideas. Finally FEP gave eight ideas for which the compounds were ordered from Enamine. They took four weeks to arrive, at a cost of less than 100 euros a compound. Five out of eight have IC50 < 100 µM. Merck has reported broad application of FEP+ across multiple targets and series. Screening of large custom-built libraries is an effective way to provide added value in the projects.18

4

Boehringer Ingelheim Comprehensive Library of Accessible Innovative Molecules (BICLAIM) Uta Lessel, Boehringer Ingelheim Pharma GmbH & Co. KG (BI), Biberach, Germany

The traditional approach to de novo drug design consists of fragmenting compounds and then joining up the fragments in artificial compound transformations. This results in huge numbers of compounds, many of which may be impossible to synthesize easily. BI’s goal was to combine de novo design with synthetic accessibility.

BICLAIM represents trillions of virtual compounds. It is impossible to enumerate all those so BioSolveIT software is used: CoLibri19 transforms synthetic knowledge into chemical spaces; FTrees11 is used to search the fragment space for compounds similar to a known active compound. Chemical fragment spaces consist of molecular fragments and corresponding connection rules. The CoLibri reaction synthesizer takes reaction definitions as an input and generates an individual fragment space for every one of them. The CoLibri fragment space merger takes the output of the reaction synthesizer (multiple individual fragment spaces) and merges them.

The result of a search is a list of components that are similar to a query, but in addition, the names of the core fragments of these hits also refer to the protocols of how to synthesize these molecules. Therefore, the compounds are accessible through internally known combinatorial synthesis protocols. This also implies that it is easy to synthesize not only a single molecule but also a whole series of analogous compounds. In 2008, BICLAIM had 1600 cores from combinatorial libraries with about 30,000 reagents, encoding about 5 x 1011 compounds. By 2020 the number of cores has risen to nearly one million.

Feature trees are 2D descriptors so after an FTrees search, 3D post-processing (by shape similarity or docking) is essential; a bioactive conformation is beneficial; and, finally, visual inspection guides library prioritization. After selection of the most appropriate scaffolds, the project team decides on the decoration of the scaffolds, and has the know-how and capacity to synthesize several hundred compounds in a focused combinatorial library. 3D alignments and information about chemical feasibility strongly support prioritization of scaffolds in the project teams.

BICLAIM offers a powerful procedure for the detection of new leads. Nevertheless, careful generation of the fragment space is a huge upfront investment. Exclusion of poorly evaluated chemistry and building blocks with unknown availability is recommended, but these sources might be useful for idea generation in a separate fragment space. The cores taken up should be broadly explored, and project-specific knowledge should be included as far as possible. BICLAIM is applied on a routine basis, before, in parallel with, or instead of high throughput screening, or in the lead optimization stage depending on project needs. Leads for many projects have been successfully provided. Lessel gave an application example: GPR119 agonists.

In summary, FTrees fragment space searches represent a powerful tool for the detection of new leads. Success depends on the quality of the fragment space; combinatorial chemistry know how; sufficient synthesis and test capacity; matching timelines for virtual screening, synthesis and testing, including one to two follow-up cycles; and, above all, a good team effort by computational, combinatorial, and medicinal chemists. 5

Build and explore virtual libraries for drug discovery projects in Janssen Zhijie Liu, Janssen Research & Development, Spring House, PA, USA

The Janssen computational chemistry group has built a proprietary virtual library (JVL) and a proprietary virtual space (JFS) to explore Janssen unique chemical space for ongoing projects. JVL is an enumerated library developed by the following process. First a set of commercial and in-house building blocks was collected and filtered. A total of two million building blocks was reduced to 900,000 by selecting preferred suppliers, ensuring that samples of at least 50 mg were available off- the-shelf commercially (not “make-on-demand”), or in-house, and by applying typical druglikeness filters. Among the 900,000 building blocks, 300,000 were identified as Janssen proprietary building blocks through comparison with commercial and ZINC20 ones. The building blocks were classified with respect to 90 functional groups. Fifty reactions with chemistry intelligence, including reaction scope and limitations, were registered.

The JVL of 100 million molecules from 50 reactions was built using the basis product approach, that is, the simplest basis products are chosen to sample fully the R-group space covered by each component. In a reaction of two reactants, A and B, with 100 building blocks of reactant A and 100 of reactant B, a full combinatorial array is (100 x 100) products. In the basis products approach, the number of products is only (100 + 100). All the compounds in the JVL have at least one Janssen proprietary building block.

The second virtual collection is the much larger JFS, which is stored as a fragment space instead of fully enumerated molecules. Using the same building blocks as before, JFS was built using 120 reactions in CoLibri19 and includes 1019 molecules. JFS is comparable in size with other virtual databases (Figure 1). As in the case of JVL, all the compounds in the JFS have at least one Janssen proprietary building block. JFS is used in conjunction with FTrees11 (see the earlier talks in this report).

JVL, JFS, and the commercial Enamine REAL4 database have been used to develop type-2 inhibitors for an ongoing kinase project, aiming to obtain high potency, slow off-rate, high kinase selectivity, high solubility, and multiple scaffolds. Initial hits were identified by virtual screening of 740 million Enamine REAL compounds. The method involved -based similarity search (FastROCS17 and FTrees), Glide docking,21 interaction and property filtering, and clustering and selection. Forty-two compounds were ordered, of which 12% were hits with potency better than 100 nM. Hits were expanded by using substructure search, Glide docking, and property filtering and selection, giving 173 compounds, 25% of which had 100 nM potency. The JVL exploration identified hits with a new scaffold using Glide docking, and filtering by interactions, score and property. The predicted docking model of a novel hit was validated by the X-ray structure. Additional exploration with JFS discovered an additional 608 molecules with the aforementioned new scaffold for the kinase project. The comparison of hits from Enamine REAL, GalaXi22 and BioSolveIT Knowledge Space23 shows that JFS covers different chemical space.

Idea2Data: expediting drug discovery through proximal library exploitation Christos Nicolaou, Eli Lilly and Company, Indianapolis, IN, USA

The drug discovery process is often illustrated in a linear way but the reality is different: hypotheses are tested in a “design, make, test, analyze” (DMTA) cycle.24 Highly integrated functions and 6

processes are needed from start to finish. With this in mind, Lilly created an integrated, globally accessible, automated chemical synthesis laboratory (ASL)25 and an automated purification laboratory. Building on this, Lilly created the Lilly Life Science Studio in San Diego to enable a computationally driven approach to the DMTA cycle. This automated, cloud-based platform consists of 16 autonomous, yet interconnected, automated workstations for functions such as compound and reagent management, synthesis, and purification; and analytical, biological and biophysical testing.

Building on these initiatives, the Proximal Lilly Collection (PLC),26 aims firstly, to define the chemical space of small, druglike compounds that could be synthesized using in-house resources and secondly, to facilitate access to compounds in this space for the purposes of drug discovery. In version 1 of PLC, reactants from vendor and Lilly in-house collections were pruned and combined with reactions amenable to automated synthesis. Reactions commonly performed on the ASL were characterized by reagent transformations and filter rules, and reaction logistics. Collections of readily available reactants came from the ASL reactant collection, the Lilly collection, and preferred vendor catalogs. These were pruned by inventory amount, and, optionally, by physicochemical properties etc. The virtual synthesis engine used the reagent pool and a growing number of reactions, eventually 25, to create 5x1011 structures, optionally postprocessed using typical druglikeness criteria, and the Lilly medicinal chemistry rules,27 etc.

Version 2 of the virtual synthesis engine used an annotated reaction repository for which a reaction database and an ontology were developed using NextMove Software’s NameRxn.28 About 2 million reactions were classified into more than 700 reaction types using the ontology implementation. More than 60,000 chemical reactions have been executed in the ASL system by more than 220 researchers.29 The PLC virtual synthesis engine now uses a data-driven annotated reaction definition, driven by machine learning, with reactions expert-reviewed rather than expert-defined. It capitalizes on the availability of clean, robust reaction data in Lilly’s synthetic history (Synthory) database. It is continuously updated, with adaptive learning.

One way to exploit PLC is interactive fragment replacement. A PDB structure is entered into the Molecular Operating Environment (MOE)30 and cleaved. The user selects a PLC reagent and scrolls through designs. The method can be applied to scaffold exploration and addressing selectivity. Other applications are focused library design;26 nearest neighbor search26 for similar compounds in PLC; and virtual screening.

The Idea2Data (I2D) paradigm31 aims for a closed-loop drug discovery platform. It is founded on the tight integration of computational methods for virtual hit identification, and automated synthesis and purification platforms. A success story31 concerns the kinase ULK1. Virtual screening of a PLC- based subset retrieved structural motifs frequently found in kinase binders, and novel structures were identified meeting the required criteria. Twenty-three compounds were synthesized and five actives were identified after biochemical and biophysical screening. In hit expansion, 260 nearest neighbors were found in the PLC subset; 116 were synthesized, 71 were tested, and numerous actives were found, including novel compounds, in three classes.

Faster screening and faster synthesis have been demonstrated, and exploiting proximal spaces could lead to more and better structural ideas. PLC has been in use for a long time and Lilly is confident that libraries such as PLC are a reliable source of chemical structures, including known actives and 7

new structures which are reliably synthesizable. Idea2Data is feasible, but not trivial. The current technology enables rapid design, and automated synthesis, and testing, in distinct, isolated steps. Cross-department collaboration is essential.

To enable more and better structural ideas, PLC must be expanded with more building blocks, and more reactions which are more diverse, and de novo design should be added. Route design could be improved with actionable, ready-to-execute routes. The closed-loop drug discovery paradigm shift is ongoing.

Introduction to DEL informatics and virtual spaces at WuXi AppTec Jason Deng and John Shirley, Wuxi AppTec DEL

DNA-encoded libraries (DELs) are collections of small molecules covalently attached to amplifiable DNA tags carrying unique information about the structure of each library member. A combinatorial approach is used to construct the libraries with iterative DNA encoding steps, facilitating tracking of the synthetic history of the attached compounds by DNA sequencing. DEL libraries are subjected to affinity selection procedures on an immobilized target protein of choice, after which nonbinders are removed by washing steps, and binders can subsequently be amplified by polymerase chain reaction (PCR) and identified by virtue of their DNA code (e.g., by DNA sequencing). Various screening protocols have been developed which allow protein target binders to be selected out of pools containing up to billions of different small molecules.

Data management and informatics are central to the successful execution of the DEL technology. Jason Deng described WuXi AppTec’s big data management and its application to the DEL platform. WuXi AppTec has made over 200 libraries, ranging in size from 10 million up to billions of compounds (terabytes to petabytes of data). An efficient system manages building blocks, tags, and library design. A highly performant database in the cloud is secure and modular (each customer’s records are isolated). The time taken to export compounds directly with enumerated structures scales linearly; 600 million compounds can be enumerated in about 22 minutes. Query time also scales linearly; 1 million compounds in historical DEL campaign datasets are searched in about 108 minutes. DEL Lake is the front end enabling “fishing in the lake” of DEL compounds.

The WuXi AppTec DEL Selection Package offers more than 50 billion compounds with preselection tests, affinity selection, next generation screening (NGS), and data analysis, for all users at a competitive market price. DELight offers 14.1 billion compounds at an economic price. DELopen, with 4.4 billion compounds, is free for academic users. Both DELight and DELopen allow users to access to DEL libraries, while providing detailed protocols for users to execute preselection tests and affinity selection in their own facilities. Users can also visit the DEL teaching lab, located in Boston, MA, to perform DEL selections under guidance by experienced experts.

John Shirley described WuXi AppTec’s two models for chemical spaces: the GalaXi22 space built in collaboration with BioSolveIT and a virtual library containing all enumerated compounds in a SMILES format. Synthetic accessibility and the company’s in-house expertise are critical to both. The GalaXi space is built with CoLibri19 (which ensures high quality processing), and it is navigated with infiniSee.13 (More details of the software were given earlier in this report.) It is based on WuXi AppTec in-stock building blocks and proven reaction schemes. The virtual library is available in 8

enumerated form. Shirley displayed histograms showing the favorable physical property distributions of the virtual library.

There are three phases for both types of library. They vary in size: about 50 million compounds for phase 1, 1.7 billion for phase 2, and 0.6 billion for phase 3. Phase 1 involves one- or two-step chemistry using building blocks from WuXi and trusted suppliers; 128 reaction protocols; and a 75% synthesis success rate for delivery of compounds in 4-6 weeks. Phase 2 is built from about 1000 WuXi templates chosen for novelty and diversity. Usually, one extra reaction step is needed, so the lead time is greater. Phase 3 delivers druglike compounds. It will use three methods, but only one implementation has been fully released to date: 550 million compounds intelligently designed using popular catalog compounds, and reaction schemes as cores for druglike virtual compounds. The second method will be computer-aided drug design and the third will be FDA-approved drugs.

WuXi AppTec has 17 years of experience in library design and synthesis. It has delivered more than 3 million compounds. A dedicated purification team accelerates library production: the typical timeline for focused libraries is 4 - 8 weeks but lead time can be as short as 8 calendar days to finish a 100-member library. WuXi has diverse materials to facilitate design and expedite production, including more than 25,000 reagents and over 8000 WuXi in-house building blocks in stock.

Exploring GSK space: practical application of large scale virtual screening Jennifer Elward, GlaxoSmithKline (GSK), Collegeville, PA, USA

SAR to enable scaffold-hopping is difficult and can occur too late to impact a series of hits. Rapid expansion of SAR after initial screening is challenging: it is limited to classic “SAR by catalog”, with small libraries, and variable turnaround times and cost. In addition, about 40% of clinical candidates are often considered difficult to synthesize in the chemical development phase.

So, GSK proposed to expand the available chemical space by using high yielding reactions and diverse building blocks; to enable robust chemistry during discovery (leading to less time in developing a clinical candidate); to search large chemical spaces rapidly; and to build on internal chemistry technologies to synthesize hits. These goals could be accomplished using large-scale virtual screening, with a focus on diversity, speed, and rapid synthesis of hits. Thus, GSK started to collaborate with BioSolveIT, and to use CoLibri19 and FTrees11 which are ideal for generating and searching large chemical spaces quickly.32 FTrees for virtual screening became a new methodology for GSK program teams to rapidly expand chemical diversity and SAR.

In a real life example, a key GSK program was in need of new chemical matter for two viable backup series. No target structure was known and SAR data were limited. Series A comprised about 160 compounds, with limited SAR and potency; series B comprised about 220 compounds with moderate potency. Both series had poor pharmacokinetics (PK).

GSK used FTrees in virtual screening of 1010 compounds in the Enamine REAL space. The output was triaged and compounds were ordered and tested in key assays to enable a critical go/no-go decision on the transition to lead optimization. Seven active compounds in Series A were selected for searching REAL. The output from FTrees was postprocessed by preferred pharmacophore feature constrained search; a targeted similarity search of >0.8 FTrees similarity; and a high diversity search of 0.5-0.8 FTrees similarity. A Series A chemical space was constructed; about 44,000 unique hits 9

were found. These were filtered by 2D similarity search, ROCS17 3D similarity search, and exclusion of undesirable moieties, leading to 300 compounds for purchase from Enamine. All these were tested in a primary potency assay; selected compounds were tested for in vitro and in vivo potency.

The team obtained a hit rate of 5.4% (pIC50 > 4.9) based on the primary potency assay; a set of low in vitro PK compounds (about 40% of the compounds tested); a more diverse chemical space with additional points of SAR to further understanding of the series and build QSAR models for key assay endpoints; and a new methodology for both computational and medicinal chemists to increase diversity and explore chemical space rapidly. PCA visualization of the Series A chemical space based on physicochemical descriptors showed how the new series A explored diverse areas of space not covered by the original one. A number of the active compounds have both high 2D Tanimoto similarity and high feature trees similarity to the query compound but it was not only high similarity of both types that produced active compounds.

The GSK space originally covered 1015 compounds in 296 single chemical spaces, from 120 individual reactions. The GSK extra-extra-large space of the future will have 1026 compounds (hence the figure given in Figure 1).

Screening billions of compounds on the AtomNet Model: approaches and future directions Venkatesh Mysore, Atomwise, San Francisco, CA, USA

AtomNet33 is a patented34 deep convolutional neural network for predicting molecular binding affinity from the protein-ligand complex. It is a deep learning system that incorporates structural information about the target to make its predictions.

Carefully curated training affinity data from public and proprietary sources inform the classifier and regressor models. Structural data are obtained from the PDB and from homology-modeled targets, with docked poses generated using off-the-shelf tools and in-house models. The convolutional architecture that captures feature locality and hierarchical composition enables the modeling of bioactivity and chemical interactions at the interatomic level. Whereas an image is represented as a 2D grid of pixels containing channels for red, green, and blue colors, the AtomNet model represents a protein-ligand pair as a set of 3D volumetric pixels containing channels for carbon, oxygen, nitrogen, etc. The model autonomously learns the features governing molecular binding, and avoids the manual process of tweaking and over-parameterizing binding features. Model development involves:

 Millions of parameters, hyperparameter tuning, and regularization.  Multiple cross-folds without data leakage.  Exploration of architectures, including graph convolutional networks.  Carefully constructed data sampling protocols for handling imbalance.  Rigorous testing, bias characterization, and benchmarking.

Recently, Atomwise has introduced an approach for trawling through billions of compounds using a more efficient model (“trawler”) to predict AtomNet scores. The most prom compounds found by the trawler are then evaluated on AtomNet, and these scores are used to improve the trawler model iteratively. The basic idea is to build a target-specific, ligand-based model for predicting the AtomNet 10

scores, and score only the top predictions of this “proxy” model on the AtomNet model. Iteratively improving the ligand-based model based on the AtomNet scores of previously chosen compounds further improves the accuracy. While sequential or iterative screening has been traditionally been performed with a manual or experimental step, the Atomwise approach is an automated algorithm with an active learning component. In 2019, Atomwise applied for a provisional patent for its Ligand- based Explorer implementation of iterative screening.

A fully connected neural network model is built with extended connectivity fingerprints (ECFP4)35 and ligand properties as features and AtomNet scores on the target as labels. The ligand-based method is used first to get an approximate prediction of binding. Exemplars from the library are selected to seed primary scores. A subset of the library is scored using the AtomNet model and target. A target-specific model is trained on the scored subset. Prediction on the entire library using the target-specific model is carried out. A subset of the library with the best predicted scores is selected. The subset is scored, and so on. Iteration is the key to the trawler approach.

Ligand-based Explorer recovers over 90 % of the truly top-scoring compounds despite scoring only 10% of the library on the AtomNet model. A library of 15 billion compounds is traversed in 3 - 4 iterations in 2 - 3 days, with only 30 million compounds actually scored on the AtomNet model. The method enables a 20- to 500-fold speed-up and can be parallelized. Scoring by proxy and AtomNet models can be parallelized using dockerized containers through cloud-computing service providers.

AtomNet has been used on libraries of 10-16 billion compounds (Enamine) for 38 targets and 57 screens. In experimentally followed-up projects, two projects found nanomolar hits; six found micromolar hits; five found no hits, and 28 are awaiting results. AtomNet has also been used on smaller libraries such as Mcule ULTIMATE.36 It is being applied in 750 drug discovery projects with nonprofit universities and research institutions with 75% success; 50 projects with pharma have also been carried out.

Future directions for traversal include generative (on-the-fly) library storage (rather than enumerative storage), reaction-based traversal of chemical space, and genetic algorithms and Markov chain Monte Carlo based exploration. In modeling, graph-based fingerprints, proteochemometric models, and a metamodel based on information from multiple screens will be investigated. Potential new applications are multitask models that predict both classifier and regressor scores, direct optimization within a series, iterative analogue-by-catalog, scaffold-hopping, and multi-objective optimization encompassing ADMET properties. Data storage and latency will be improved and memory footprint will be reduced.

Large scale “feature database” to support machine learning in drug discovery Rick Stevens, Argonne National Lab and University of Chicago, USA

At the beginning of the pandemic, Argonne scientists launched an effort to adapt their cancer drug screening pipelines for COVID-19. They wanted to expand the reach of virtual screening to essentially all the publicly available compound datasets. A conventional virtual screening approach was not going to be scalable to the 100 or so models they had created for COVID-19 drug targets: a hybrid AI/HPC approach was needed to give the throughput to dock 400-800 billion compounds virtually. 11

Argonne chose to use SMILES, Mordred descriptors,37 fingerprints, and 2D images as the base datatypes. These can also support graphs on the fly. Images are interesting since you can use them with convolutional neural networks (CNNs) etc. Argonne’s regression enrichment surfaces38 have been used in evaluating machine learning models in a virtual screening context. Images do better than graphs in the relevant regions of the regression enrichment surface.

The nCov-Group Data Repository39 contains representations and computed descriptors for around 4.2 billion small molecules (60 TB of data). Twenty-three datasets were input (Enamine, PubChem,40 ZINC, SureChEMBL, eMolecules etc.). Each molecule was converted to a canonical SMILES. For each molecule, the system computed about 1800 2D and 3D molecular descriptors using Mordred, molecular fingerprints for similarity searching, and 128x128 pixel, 2D images of the molecular structures. RDKit41 was used to process the SMILES. InChIKeys were also computed. The data processing pipeline used about 2 million core hours on three supercomputers: the Argonne Leadership Computing Facility Theta (a Cray XC40), the Texas Advanced Computing Center Frontera system (Dell EMC, powered by Intel processors), and the Oak Ridge National Laboratory Summit system (an IBM POWER9 processor with NVIDIA GPUs).

Integrated Modeling PipelinE for COVID Cure by Assessing Better LEads (IMPECCABLE)42 is a scalable AI-HPC infrastructure for drug screening (Figure 2). The constituent components are a deep learning based surrogate model for docking (ML1); Autodock-GPU43 for docking (S1); coarse- and fine-grained binding free energy calculations (S3-CG and S3-FG); and DeepDriveMD (S2)44 for protein folding simulations. The ML1 surrogate: is a “resnet-50” deep neural network38 that transforms image representations of ligand molecules into a docking score.

Figure 2. AI-HPC infrastructure for drug screening.

It is estimated that 1010 docking calculations will be needed and 1012 docking surrogate (machine- learning inference) docking scores (taking 3.6 hours per 4.2 billion); and thousands of machine learning molecular dynamics calculations over multiple platforms: 5 x 104 binding free energy calculations across machines, in 2.5 x 106 node-hours (equal to about 25 days of 100% of Summit). The estimate for S1 is 1.25 x 106 node hours on Frontera. The HPC computational infrastructure and capabilities are robust and extensible, operating over multiple heterogeneous resources. 12

The National Virtual Laboratory (NVBL)45 has completed the first rounds of virtual screening and has experimentally screened 1200 compounds on 10 COVID-19 drug targets. Sixty- three hits are in downstream experimental pipelines; 600 additional molecules are entering antiviral assays. The compound database is now being used to support a variety of projects in cancer and machine learning.

Revealing antiviral hits among a billion molecules with a combination of ligand- and target-based approaches Vladimir Poroikov, Institute of Biomedical Chemistry, Moscow, Russia

New drug discovery is based on the analysis of public information about the mechanisms of the disease, molecular targets, and ligands, and the interactions with the target that could lead to the normalization of the pathological process. The available data, and the combinatorics of the associative relations between them, correspond to “Big Data” (Figure 3).46

Figure 3. From molecules to drugs. From data to information to knowledge.

Big data in chemical space have been illustrated in Figure 1. For exploratory research in ultralarge databases, Poroikov and his colleagues have proposed the local correspondence concept. Here, the biological activity of a druglike organic compound is based on molecular recognition between particular atoms of the ligand and the target. Using this concept, the team has developed a system of atom-centered neighborhoods of atom descriptors, including Multilevel Neighborhoods of Atoms (MNA), Quantitative Neighborhoods of Atoms (QNA), and Labeled Multilevel Neighborhoods of Atoms (LMNA).47-49 These descriptors are implemented in Prediction of Activity Spectra for Substances (PASS)50,51 and in several QSAR applications. PASS online has been available since 2000. A recent analysis demonstrated that the performance of its machine learning methods surpassed those of other chemical similarity assessments, particularly in the case of novel repurposed indications.52 A number of applications of General Unrestricted Structure-Activity Relationships (GUSAR) have been reported.48,53-55

To analyze ultralarge chemical databases, Poroikov and his colleagues have applied a complex computational approach, which combines structural similarity assessment, machine learning, and molecular modeling.56 They have reported an approach for the identification of potential pharmacological substances in very large databases of a billion or more druglike compounds. The SAR training set for PASS qualitative predictions is built on data from Cortellis Drug Discovery 13

Intelligence,57 the National Institute for Allergy and Infectious Diseases, PubChem,40 and ChEMBL.58 Predictions from 961 million compounds in the Synthetically Accessible Virtual Inventory (SAVI, described later in this report)59 were done in the cloud. The general workflow56 consists of three stages: (1) chemical similarity assessment (using MNA and QNA) (2) prediction of biological activity using machine learning methods (GUSAR), and (3) visual inspection of the binding poses, and estimation of the scoring function using molecular docking (using ICM-Pro60 on Biowulf, NIH’s supercomputer).

This approach was used in identification of inhibitors of HIV-1 protease and reverse transcriptase, or agonists of TLR and STING, by virtual screening of SAVI. Three potential TLR 7/8 agonists were selected for experimental testing; activity was confirmed in a cell-based assay. These compounds belong to chemical classes in which the agonistic effect on TLR 7/8 had not been previously shown. Synthesis of 36 compounds has been carried out, using SAVI synthetic reactions. The antiretroviral activity of the hits is being studied.

The researchers also carried out virtual screening of more than 1 billion compounds in ZINC, SAVI, Aldrich Market Select, and PASS drugs, to find compounds with anti-SARS-CoV-2 activity, in the Joint European Disruptive Initiative (JEDI)61 Grand Challenge against COVID-19. Forty-eight molecules selected by the team were included in the final list of 1000 molecules and are being synthesized.

Large chemical libraries significantly extend chemical space but the ability to identify new pharmacological agents may be limited by the existing knowledge used as the basis for computational estimations.

The GDB databases and their use for drug discovery Jean-Louis Reymond, University of Bern, Switzerland

Reymond’s team has enumerated all possible molecules following simple rules of chemical stability and synthetic feasibility to form the Generated DataBases (GDB): GDB-11,62 GDB-1363 and GDB-17.64 GDB-17 contained 166.4 billion molecules of up to 17 atoms of C, N, O, S, and halogens. Due to the combinatorial explosion caused by systematic enumeration, GDB-17 is strongly biased toward the largest, functionally and stereochemically most complex molecules and is far too large for most virtual screening tools. A much smaller subset of GDB-17, called the fragment database FDB-17,65 contains 10 million fragment-like molecules evenly covering a broad value range for molecular size, polarity, and stereochemical complexity.

Reymond’s team has also explored the chemical space of all virtually possible organic molecules, focusing on ring systems, which represent the cyclic cores of organic molecules obtained by removing all acyclic bonds and converting all remaining atoms to carbon. The chemical universe database GDB4c66 contains 916,130 ring systems of up to four saturated or aromatic rings and maximum ring size of 14 atoms. GDB4c3D contains the corresponding 6,555,929 stereoisomers.

GDBMedChem67 is a collection of 10 million small molecules constructed by applying rules inspired by medicinal chemistry to exclude problematic functional groups and complex molecules from GDB- 17, and sampling the resulting subset uniformly across molecular size, stereochemistry and polarity. GDBChEMBL68 is a subset of GDB-17 featuring 10 million molecules selected according to a ChEMBL- 14

likeness score calculated from the frequency of occurrence of circular substructures in ChEMBL,58 followed by uniform sampling across molecular size, stereocenters, and heteroatoms.

GDB-18 would be 50 times larger than GDB-17 so the researchers studied sampling versus exhaustive enumeration in collaboration with AstraZeneca, in the BigChem project. They performed a benchmark on models trained with subsets of GDB-13 of different sizes with different SMILES variants, with two different recurrent cell types, and with different hyperparameter combinations.69 New metrics were developed that define the generated chemical space with respect to its uniformity, closedness and completeness. The results showed that models that use long short-term memory (LSTM) cells trained with 1 million randomized SMILES are able to generate larger chemical spaces than the other approaches, and they represent more accurately the target chemical space. In collaboration with Novartis, Reymond and co-workers trained an LSTM using molecules from ChEMBL, DrugBank,70 commercially available fragments, or FDB-17 and performed transfer learning to a single known drug to obtain new analogues of this drug.71 They found that this approach generates hundreds of relevant and diverse new drug analogues and works best with training sets of around 40,000 compounds as simple as commercial fragments.

The team has also worked on visualization and searching of GDB The MQN-mapplet72 is a Java application giving access to the structure of small molecules in large databases via color-coded maps of their chemical space. These maps are projections from a 42-dimensional property space defined by 42 integer value descriptors called molecular quantum numbers (MQN)73 which count different categories of atoms, bonds, polar groups, and topological features, and categorize molecules by size, rigidity, and polarity. An MQN-annotated version of PubChem with an MQN-similarity search tool74 is available on the web.75

The FUn framework consists of client and server modules, facilitating the creation of web-based, interactive 3D visualizations of large datasets.75-77 Nearest neighbor similarity search can be carried out in MQN space.78 GDB has been used for ligand discovery: a virtual screening procedure in the MQN web browser led to three α7-nicotinic receptor ligands.79

MQN works well with ZINC but for GDB-13 and higher it might be overwhelmed so finer analysis of GBD databases is needed. MHFP6 extracts the SMILES of all circular substructures around each atom up to a diameter of six bonds and applies the MinHash method to the resulting set increasing the performance of exact nearest neighbor searches, and enabling the application of locality-sensitive hashing (LSH) approximate nearest neighbor search algorithms.80

Substructure fingerprints perform best for small molecules while atom-pair fingerprints are preferable for large molecules. A new fingerprint called MinHashed atom-pair fingerprint up to a diameter of four bonds (MAP4) is suitable for both small and large molecules.81 In this fingerprint, the circular substructures with radii of r = 1 and r = 2 bonds around each atom in an atom-pair are written as two pairs of SMILES, each pair being combined with the topological distance separating the two central atoms. These so-called atom-pair molecular shingles are hashed, and the resulting set of hashes is MinHashed to form the MAP4 fingerprint.

Reymond’s team has also reported a new data visualization method, TMAP, capable of representing datasets of up to millions of data points and arbitrary high dimensionality as a 2D tree.75,82 TMAP involves (a) LSH forest indexing, (b) construction of a k-nearest neighbor graph, (c) calculation of a 15

minimum spanning tree (MST) of that graph, and (d) generation of a layout for the resulting MST. Visualizations based on TMAP are better suited than t-SNE or UMAP for the exploration and interpretation of large datasets due to their tree-like nature, increased local and global neighborhood and structure preservation, and the transparency of the methods the algorithm is based on. TMAP has been applied to ChEMBL, FDB-17, and other common datasets.

All the GDB databases are available on the web for download and free use, together with interactive visualization and search tools.75. GDBspace83 provides chemical-based companies with easy and reliable orientation within the space of molecules.

Reymond concluded with four examples of the use of GDB. The systematic diversification of known ligands by enumeration with help of GDB followed by virtual screening, synthesis, and testing has been exemplified by analogues of GLT-1 inhibitors.84 A team other than Reymond’s has synthesized trinorbornane: a new rigid structural type which until then had no real-world counterpart but was found to be present in GDB.85 Exploring the GDBs shows that many chiral, 3D-shaped ring systems, often containing quaternary centers, have never been exploited for drug design. Meier et al.86 have used the enantioselective synthesis of triquinazine to design a nanomolar and selective inhibitor of Janus Kinase 1. Finally, a retrosynthetic accessibility score (RAscore) enables rapid estimation of synthetic feasibility as determined from the predictions of computer-assisted synthesis planning (CASP).87,88

Combinatorial approaches for searching synthetically accessible chemical space Matthias Rarey, University of Hamburg, Germany

FTrees for searching fragment spaces89 is widely used today. Two decades ago, it was developed for and applied on chemical spaces created by shredding, often resulting in recombined molecules which were difficult to synthesize. The classical paradigm of virtual screening (Table 1) now fails due to the sheer number of compounds contained in today’s huge make-on-demand compound libraries.

Table 1. Hit Identification Methods and Chemical Space

The query Search Topological Maximal Reduced Shape Pharmacophore Docking space similarity common graphs substructure (MCS) 10-1000 Yes Yes Yes Almost Yes Borderline 1 million – Yes Yes Yes Almost Almost Borderline 100 million 1010-1060 Yes In progress Yes No No No

To address this problem, SpaceLight90 uses classic extended connectivity fingerprints (ECFP)35 and connected subgraph fingerprints (CSFP)91 in a new search approach for similarity searching. There are three versions of CSFP: fCSFP for fine-grained similarity measurement, iCSFP, an MCS-like descriptor, and tCSFP with scaffold-hopping potential. SpaceLight takes advantage of the combinatorial nature of chemistry where the mixture of intermediates from reactions between reagent pools A and B in turn reacts with another pool of reagents to give pool C. Newly formed bonds and structural changes in educts are captured in a topology graph. 16

The topology search algorithm proceeds as follows. First, all ways of partitioning the query compound into connected substructures are determined. The size and topology of the partitions must resemble topology graphs of the topological fragment space. Then there is a matching step enumerating all possible matches of the partition classes of all partitions onto nodes of compatible topology graphs. The partition classes must be similar in size to some fragment in the node and have the same connectivity as the matching node. A topology score measures the similarity of the chemical bond types and the types of chemical bonds connecting the partition classes of the query.

Next comes a comparison step in which the similarity between a connected substructure and all fragments contained in the topology node is calculated for each matched pair of partition class and node of a topology graph. The fragments are ranked based on their similarity to the connected substructure of the query, as measured by the Tanimoto coefficient and ECFP or CSFP fingerprints.

Finally, in a combination step, for each matching in the matching step, the fragment combinations most similar to the matched partition are determined, using the fragment rankings for each node of the matched topology graph derived from the comparison step. As a combination can occur in similar results for multiple matchings, the results are then summarized across all matchings to determine the overall most similar compounds to the query compound in the product space.

The results of the search method are in good agreement with noncombinatorial fingerprint methods using ECFP and CSFP, but retrieved compounds may diverge because, in compound fragmentation, the algorithm captures the locality of features. On the whole, this notion of similarity seems to be beneficial.

SpaceLight is able to search spaces like Enamine REAL, with more than 10 billion compounds, in seconds on a standard desktop computer. In common with FTrees, SpaceLight scales with the number of fragments involved rather than with the number of products, so even chemical spaces of more than 1020 molecules can be searched.

The challenge now lies in 3D. Together with AstraZeneca, Rarey’s team is working on Galileo, involving 3D genetic operators for searching in chemical space. In very recent work with BioSolveIT and Servier, Rarey’s team has analyzed the performance of shape-based directional descriptors, called ray volume matrices (RVMs), in fragment growing, opening the door to fast explorative screening of fragment libraries.92 Fast 3D searching requires precise protein-binding ligand pose optimization: Rarey’s team93 have reported a consistent scheme for pose scoring and gradient-based pose optimization. It consists of a novel variant of the Broyden–Fletcher–Goldfarb–Shanno (BFGS) algorithm, called limited step length (LSL) BFGS. The combination of LSL-BFGS and JAMDAscorer leads to high docking power, high optimization locality, and a low number of score evaluations. There is information on software availability on the web.94,95

Rarey is convinced that combinatorial searching will replace virtual screening sooner or later. With the progress made in the past years with make-on-demand compound libraries, it is the only computational methodology able to cope with these vast collections. Currently Rarey’s lab is working on a comprehensive solution for substructure and MCS search. While algorithms for topological search are at hand today, 3D searching of 1010-1060 compounds is, however, a much greater challenge (Table 1). Work is in progress on MCS search but 3D searching of 1010-1060 compounds is a much greater challenge (Table 1). 17

Virtual screening of ultralarge chemistry databases John Irwin, University of California San Francisco, USA

In docking screens, libraries of about 107 molecules from the ZINC20 database are interrogated for those that complement a protein structure. For each molecule in the library, hundreds of orientations are sampled in the protein binding site, and for each orientation there are typically hundreds of conformations. Overall, 1012− 1013 ligand complexes are calculated in a large library screen, each ranked using scoring functions that consider several polar, nonpolar, and solvent- dependent scoring terms, all approximate. This step takes less than a day for a library of three million compounds.

The researchers then apply a post-processing step that filters out compounds that are chemically similar to known ligands, and they rank and manually inspect and select candidates from the top 0.1% of the library. Since the compounds that Irwin’s team docks are commercially available, they can be rapidly tested to find new leads. This century, there has been an explosion in the number of readily accessible molecules, especially since Enamine’s make-on-demand libraries became popular in 2016.

In one example, Irwin and co-workers carried out structure-based docking of 138 million Enamine

REAL make-on-demand compounds against the D4 dopamine receptor. From the top-ranking molecules, 549 compounds were synthesized and tested for interactions with the D4 dopamine receptor. Hit rates plateaued at a high score and then fell almost monotonically to zero with docking score, and a hit-rate versus score curve predicted that the library contained 453,000 ligands, in 72,600 scaffold families. Of 81 new chemotypes discovered, 30 showed submicromolar activity, 8 including a 180 pM subtype-selective agonist of the D4 dopamine receptor.

In another example, novel melatonin receptor ligands with new pharmacology were discovered.9 The team docked more than 170 million virtual molecules against melatonin MT1. From the results, 38 high-ranking molecules were synthesized and tested. After structure-based optimization, fifteen novel chemotypes were found, with potencies between 470 pM and 6 μM, two of which were selective MT1 inverse agonists. These showed unexpected in vivo behavior in circadian rhythm. They advanced the phase of the mouse circadian clock by 1.3-1.5 hours when given at subjective dusk, an agonist-like effect that was eliminated in MT1- but not in MT2-knockout mice.

Little is known about the sigma2 receptor. Irwin and his co-workers are using docking to deorphanize this receptor and gain understanding about its role. Large-scale docking of 103 million cations from ZINC1596 against the sigma2 receptor predicted new chemotypes. Of 86 compounds ordered from Enamine on January 3, 2020, 79 were received by February 25, 2020. The 13 best sigma2 molecules from the primary screen have a Ki value against sigma2 from 2.4 nM to 67.8 nM. At 1 µM concentration, 52 of the 79 molecules displaced more than 50% radioligand 1,3-Di-(2- tolyl)guanidine (DTG), a high hit rate of 66%. A full hit rate curve may reveal a slope to the upper plateau. The work is not yet published.

ZINC had grown from nearly 9 million compounds in 2015 to over 600 million in 3D by 2020, so a redesign was needed: ZINC2097 uses NextMove Software’s SmallWorld98 for similarity searching, and Arthor6 for pattern and substructure search,99 as well as 3D methods such as docking. 18

Integrated computational platform for chemistry automation Gergely Zahoranszky-Kohalmi, Samuel G. Michael, G. Sitta Sittampalam, Alexander G. Godfrey National Center for Advancing Translational Sciences (NCATS, NIH), USA

Within the framework of the program called “A Specialized Platform for Innovative Research Exploration” (ASPIRE),100 NCATS has built an integrated computational platform, called “AICP”, which aims to design, synthesize, and screen molecules in an autonomous manner, with the overall goal of accelerating molecular drug discovery. The team has developed a reaction knowledge base and has curated a small-scale reaction dataset extracted from in-house ELNs. The knowledge base can be used for retrosynthesis, reaction prediction, conditions prediction, reaction templating, and hypothesis generation.

NextMove Software’s HazELNut software101 is used to export, normalize and exploit reactions from NCATS ELNs. NextMove Software’s NameRxn28 is used for atom mapping, named reaction identification, and classification. The data are postprocessed in a Python pipeline, implemented as a KNIME and custom Palantir102 workflows.

Rancho Biosciences of San Diego curated the data. They encountered many challenges in curating 16,000 reactions: unbalanced reactions; inconsistent use of roles (reactant versus reagent); multistep reactions; lack of systematic capture of failed or successful reactions; and lack of a standardized process description. No clear path could be defined that would satisfactorily improve data input from end-users without a significant rewrite of the ELN application itself. The reactions are stored as a Neo4j graph database and a PostgreSQL (RDKit)41 cartridge, allowing substructure and similarity search and other operations. The scalability of the PostgreSQL cartridge solution is unproven.

Efficient exploration of chemical space is a priority of the ASPIRE program. The ideal chemical space embedding method should allow intuitive interpretation by medicinal chemists: the placement of scaffolds should reflect the medicinal chemist’s thought process and molecules of similar structure and complexity should cluster in the generated map. A generated map should be robust as regards embedding and overlaying new datasets. The method should facilitate the comparison of chemical space coverage across multiple datasets, and should scale to large datasets.

The Hilbert-curve assisted structure embedding method (HCASE)103 uses pseudo-Hilbert curves (PHC) and scaffold keys. Scaffold Keys104 are scaffold descriptors based on simple topological parameters such as number of ring and chain atoms, number and type of heteroatoms, and other simple structural features. They support visualization of large chemistry datasets. Zahoranszky-Kohalmi et al. analyzed the embedding of approved drug molecules and natural products into chemical spaces defined by 55,961 unique Bemis Murcko scaffolds105 extracted from ChEMBL.58 They demonstrated that the performance of the HCASE method is not only on a par with prior art methods such as t- stochastic neighbor embedding (t-SNE)106 but excels in providing intuitive chemical space embedding. Mapping to the PHC translates to the clustering of reference scaffolds. Resolution (the order of PHC) can be varied, and increasing the resolution leads to convergence in space. Natural products are not well-represented in ChEMBL: they have only 546 scaffolds out of the 55,961.

ASPIRE requirements are molecular properties (state of matter and solubility); novel tools and integration of existing ones to aid data exploration in an ultralarge ; standardized 19

synthesis protocols; a large dataset of annotated reactions with standardized and machine interpretable reaction mechanism representation, and reaction outcomes; analytical data, and API access. The ELN of the future should scale to support high-throughput chemical synthesis. It should address individual versus blocks of reactions. It should have API access, focus on information integration and visualization, and be storage agnostic.

The art of navigating in chemical bioactivity space Tudor I. Oprea, University of New Mexico Health Sciences Center, USA

Informatics, data science and machine learning can be used with the three pillars of drug discovery: diseases, targets, and drugs. The Illuminating the Druggable Genome (IDG) consortium107 aims to shed light on areas of understudied targets. The Target Central Resource Database (TCRD)108 is the central resource behind the IDG Knowledge Management Center. The multiple data sources in the integrated knowledge base are available through Pharos.108

Ultralarge databases might have an upper limit of 1060 unique chemicals but Oprea thinks that the focus should not be on eye-catching numbers but on molecular diversity and druglikeness. Nevertheless, new chemical navigation tools are needed.

In chemical space, as in geography, maps need to be consistent. Oprea’s chemical global positioning system, ChemGPS,109 makes a drug space map by systematically applying conventions when examining chemical space, in a manner similar to the Mercator convention in geography. Chemography is the art of navigating in chemical space. Rules are equivalent to dimensions (e.g., longitude and latitude), while structures are equivalent to objects (e.g., cities and countries). Selected rules include size, lipophilicity, polarizability, charge, flexibility, rigidity, and hydrogen bond capacity. Core structures include most marketed drugs with good oral permeability, as well as other biologically active compounds, while “satellites” are intentionally placed outside the chemical space of drugs, and include molecules having extreme values in one or more of the dimensions of interest. The map coordinates are t-scores extracted by principal component analysis (PCA) from 72 descriptors that evaluate the rules on a total set of 525 satellite and core structures.

Oprea’s team have used ChemGPS molecules in conjunction with the VolSurf descriptors110 relevant for absorption, distribution, metabolism, and excretion (ADME) properties111 in GPSVS. The first GPSVS principal component correlates, with no further training, to passive transcellular permeability, the second to solubility. Although derived from PCA, the two property axes rotate and are no longer orthogonal. GPSVS can be used to map the chemical space with respect to permeability and solubility, as recommended by FDA’s biopharmaceutics classification system. ChemGPS is a heuristic model. Outliers detected via PCA were progressively included in the ChemGPS system.112,113

Do property-based descriptors compute relevant information? To illustrate “property confusion”, Oprea showed four very different drugs with molecular weight between 301.34 and 301.48, five others with logPo/w 1.97, and five more with molecular weight between 283.2 and 285.4 and CLogP between 2.89 and 2.99. Moreover, different chemical structures can have an identical set of descriptors (“descriptor collision”). 20

To establish how reliable fingerprint technology is in mapping structures, Oprea’s team investigated, with several descriptor systems, structures from the World of Molecular Bioactivity (WOMBAT) medicinal chemistry database and the iResearch Library114 of commercially available chemicals.115 For each database, they extracted all the unique, nonstereoisomeric SMILES, and found 98,575 structures in WOMBAT, and 13,334,014 structures in iResearch Library. WOMBAT had 4.7% - 8.8% duplicate structures (depending on the descriptor set) and iResearch Library 5-3%-14.5%.

We have reached the stage where “typical” 2D descriptors do not have enough depth to handle the information density of ultralarge chemical databases. We lack truly predictive models for aqueous solubility, for example, and the models, in turn, are limited by the lack of quality data. While the community develops novel navigation tools, the divide-and-conquer approach may offer a temporary solution.

Oprea and his co-workers have developed an algorithm for exhaustive generation of scaffold topologies and an efficient comparison method for graphs.116,117 A topology is a graph in which every node has either 3 or 4 associated edges. The unique characterization of scaffold topologies can be used to identify chemical subspaces and may lead to more efficient ways to query large chemical databases. Current chemical information systems would query only a specific topology.

The known chemical bioactivity space can be mapped. This implies that we can generate models that are useful in small increments, allowing us to “peek” into the unknown (but not necessarily address “affinity cliffs”). There needs to be a balanced choice between good coverage of chemical space and biological space (targets) in order to build reliable models. Despite the constraints of learning from chemical space already mapped, AI and machine learning methods are increasingly more reliable in predicting previously unmapped chemical spaces (the “unknown unknown”).

Insilico Medicine have validated several AI-generated small molecules, optimized on a complex multiresponse landscape. Their Chemistry42 suite118 is an automated machine learning platform for drug design capable of finding novel lead-like molecules in a week. Oprea observes that we primarily think of therapeutics as targeting and nucleic acids in drug discovery, with the primary challenge being related to chemical structure optimization, followed by the choice of “right target”, appropriate drug dosage for a specific patient subpopulation, etc.119 Despite major advances in pharmacology, we continue to conduct science using the “silo” approach, often without seeing the complete clinical tableau (disease biology).

What are the limits in mapping chemical space? In the comfort zone of the “known known”, chemical information is well inside the boundary of known medicinal chemistry. AI and machine learning models can extrapolate into the “known unknown”. Beyond that are the limits of knowledge where there is little or no chemical space or target information and beyond that are the unknown unknowns.

SAVI: billions of easily synthesizable compounds generated with expert system rules Marc Nicklaus and Nadya Tarasova, NCI, NIH, USA

The Synthetically Accessible Virtual Inventory (SAVI)120 project aims at computationally generating a very large number of easily and inexpensively synthesizable novel screening compounds for the purpose of drug discovery. To produce SAVI, the project team needed a set of highly predictive and 21

richly annotated rules (“transforms”); a set of reliably available and inexpensive starting materials; and a cheminformatics engine. The transforms used for Logic and Heuristics Applied to Synthetic Analysis (LHASA), first designed for retrosynthetic analysis, are used in SAVI. They are based on the language pair CHMTRN and PATRAN.121 The starting materials used for SAVI were initially from MilliporeSigma, and the cheminformatics engine is CACTVS from Xemistry.122

SAVI-2020 (internally stored in a PostgreSQL database)123 contains 1,748,464,003 compounds (1,526,316,392 of them unique structures). There were about 1000 full downloads (out of 59,456 download accesses) between April and October 2020. The SAVI Plus subset benefits from strict scoring by the LHASA SUBTRACT/KILL rules (do not save the reaction if there is any negative scoring by LHASA transform, just KILL the reaction). About 150 SAVI-2020 products have been synthesized in NIH drug development projects so far, all from the Plus subset; the success rate was about 97%.

The first foray into two-step reactions has begun. This could lead to more than 1 trillion compounds, so selectivity and subsetting are likely to be necessary. Plans for the future are a GUI for fast searches, on a public server; expansion of transforms; broadening of the building block set; easier writing of new transforms; and a more modern way of applying transforms for product generation. CHMTRN is an old unstructured and nonstandardized language working retrosynthetically, which presents challenges in a forward-synthetic context. To overcome these limitations, the team has developed Smarts and Logic In ChEmistry (SLICE), which combines SMARTS with a logic language.124

SAVI can be mined for potential drugs. Some 80-90% of human proteins cannot be targeted by established modalities. Is it really impossible to target cleft-less proteins with small molecules, or have we been searching in a wrong galaxy of the chemical universe? To answer this question, virtual screens have been run using ICM-Pro60 on Biowulf, NIH’s supercomputer. Two rounds of docking are followed by manual verification. Chosen compounds are synthesized in-house or by Enamine and assays are done in-house at NCI.

SAVI was tested in screens for 16 challenging, nondruggable targets and 38 X-ray structures. Out of 70 compounds ordered, 68 were successfully synthesized. A typical workflow involved verification that the X-ray structure was suitable for virtual screens; a docking screen of the SAVI diversity set of 2,955,416 compounds; synthesis and testing of hits; fragment-based and 2D-similarity searching of the entire database for analogues of identified hits; docking of analogues; and structure optimization using traditional medicinal chemistry.

Even for the most challenging targets, the team was able to identify compounds with binding affinity in the micromolar range, or better. SAVI allowed for identification of inhibitors with nanomolar affinity for two targets widely considered nondruggable. Although SAVI and REAL are made from the same building blocks, they appear to perform differently in screens. (The overlap between the two databases is, incidentally, less than 10%). The SAVI transform defines the relative positioning of the fragments in the resulting compound. For example, Sonogashira coupling creates a rigid scaffold with a unique shape. Part of a molecule (e.g., a functional group) generated by a transform can also contribute directly to the binding.

In some cases, however, the team was unable to fill binding pockets completely and effectively, probably because of the limited diversity of the libraries. Ways of improving library diversity are use of more building blocks, more transforms, and more synthesis steps. Using more transforms seems 22

to be the best approach: addition of new chemistries can increase the size and diversity of the databases.

ATOM. Scalable deep learning of generative models for molecular design optimization Jim Brase, Sam Jacobs, Brian Van Essen, Lawrence Livermore National Laboratory (LLNL), California, USA

The Advancing Therapeutic Opportunities for Medicine (ATOM) consortium125 is a public-private partnership with the mission of transforming drug discovery by accelerating the development of more effective therapies. ATOM will establish an open framework for generative molecular design with human-level predictive models and active learning. They are integrating high performance computing, diverse biological data, and emerging to create a new precompetitive platform for drug discovery. R&D program components are open, curated datasets ready for modeling; tools and frameworks for predictive modeling R&D; and an open generative molecular design platform.

Figure 4. ATOM generative framework for molecular design.

The ATOM Modeling PipeLine (AMPL)126,127 for drug discovery has been released and will be extended to multiscale human system models. High-performance multiparameter optimization is in place in the generative molecular design platform loop (gray in Figure 4) and it has been demonstrated and initially validated on an AURK cancer target. About 200 compounds with high potency, selectivity, and other favorable properties were found. Efficacy of generated compounds not in the ATOM database was well predicted (R2 for AURK A was 0.68; for AURK B it was 0.75). The active learning loop (yellow in Figure 4) is a work in progress; pilot projects on COVID-19 have been set up with partners.

The generative model is creating a limited range of new molecules because the greedy genetic algorithm converges on a narrow region of chemical space, and the variational autoencoder (VAE) trained on a project-specific compound set can only generate compounds with the same “vocabulary”. Training on a more diverse set will alleviate the VAE issue.

State-of-the-art Junction-Tree-VAEs (JT-VAE) are slow to train. The LLNL team wanted to identify a scalable neural network architecture and explore new generations of character-based sequence models. They encode a compound as a SMILES string; perturb the chemical’s latent representation; 23

decode a new chemical using a generational auto-encoder architecture; and evaluate its similarity to the original compound using Tanimoto similarity.

Training the molecular generator with more compounds should lead to an increase in chemical diversity of proposed solutions. The size of the training dataset was increased to 1.613 billion compounds from Enamine REAL. The targeted chemical compound database “Mpro_inhib” was 1 million additional purchasable compounds screened for SARS-CoV-2 main protease (Mpro) inhibition activity. The test set was a held-out set of 2 million Enamine compounds and 10,000 Mpro_inhib compounds.

The team proposed a new character-Wasserstein Autoencoder (cWAE) that tackles the issue of direct molecular reconstruction required for lead optimization. Training is carried out on 16,640 GPUs on Sierra, Lawrence Livermore’s advanced technology HPC system. The Livermore Big Artificial Neural Network Toolkit (LBANN) enabled training of models at scales previously unobtainable. The novel tournament algorithm, Let a Thousand Flowers Bloom (LTFB), scales up the training of deep neural networks on massive datasets, making use of leadership-class HPC systems.128 cWAE outperforms state-of-the-art Junction-Tree-VAE (JT-VAE). Improved Tanimoto distance ensures that reconstructed compounds have chemical similarity to the original compound. cWAE is faster to train and use for inference than JT-VAE. The team have compared the unique compound generation and diversity from the AURK pilot with the SAR-CoV-2 pilot and showed that the cWAE autoencoder provides improved diversity over the trained chemical space.

The ATOM generative molecular design loop runs in a hybrid cloud-HPC workflow environment. Deep learning at these large scales has exposed unforeseen challenges such as large power swings that require specialized solutions. In this case, timing offsets were inserted to reduce the synchrony of training updates and the peak power required by the GPUs. The next step is to integrate cWAE into the ATOM design loop.

Compression of chemfp databases Andrew Dalke, Dalke Scientific, Trollhättan, Sweden and Brian Cole, D. E. Shaw Research, New York, USA

Chemfp129 is an analytics platform for cheminformatics fingerprints. It includes command-line tools and a Python library for fingerprint generation and high-performance similarity search. The Tanimoto similarity between two bit string fingerprints is a useful mechanism to characterize molecular similarity. Bits are set to 1 or 0 for substructures present or not present in a structure. Chemfp is designed for short dense fingerprints, typically 2048 bits or shorter. The Tanimoto similarity between two bit string fingerprints is computed as:

푇푎푛푖푚표푡표 (푓푝1, 푓푝2) = 푝표푝푐표푢푛푡(푓푝1 & 푓푝2)⁄푝표푝푐표푢푛푡(푓푝1 | 푓푝2) (1) where “&” and “|” denote bitwise binary-and and binary-or, and “popcount()” is the number of 1 bits in the resulting subexpressions, often called the “population count”. One way of speeding up Tanimoto calculations is to precompute the popcount of each target fingerprint. If the query popcount is A and the target popcount is B, and the goal is to find all fingerprints with a similarity threshold T, then Equation (1) can be rewritten with only the intersection popcount: 24

푝표푝푐표푢푛푡(푓푝1 & 푓푝2) ≥ 푇(퐴 + 퐵)⁄(푇 + 1) (2)

If all fingerprints with a given popcount B are grouped together then the inner loop of the search reduces to the intersection popcount and a comparison to a constant.

The fastest popcount implementations use special CPU instructions. The POPCNT instruction on an x86 server is six or seven times faster than a lookup table implementation, and an Intel Advanced Vector Extensions 2 (AVX2) popcount is about 40% faster still. Chemfp can do 130 million 1024-bit fingerprint Tanimoto calculations per second on a single core of a standard x86-64 server machine. When combined with the BitBound pruning algorithm,129,130 a k = 1000 nearest-neighbor search of the 1.8 million 2048-bit Morgan fingerprints of ChEMBL 24 averages 27 ms/query. The same search of 97 million PubChem fingerprints averages 220 ms/query, making chemfp one of the fastest CPU- based similarity search implementations.

The limiting factor to search speed is no longer CPU performance but memory bandwidth and latency. Single-threaded search uses most of the available memory bandwidth of a memory channel. Sorting the fingerprints by popcount improves memory coherency, which when combined with four OpenMP131 threads makes it possible to construct an N × N similarity matrix for 1 million fingerprints in about 30 minutes.

Chemfp natively supports two fingerprint file formats: a text- and line-based exchange format (FPS) which is simple to read and write, and a more complex binary format (FPB) which is faster to read and write.

In future there are plans for improved tools for sharded searches. (Sharding is a type of horizontal partitioning that splits large databases into smaller components, which are faster and easier to manage). There are also plans for Apple M1 CPU support; use of compressed in-memory fingerprints; and, in the long-term, support for sparse and sparse count fingerprints.

Brian Cole presented a client case study on the compression of chemfp databases. Sharding databases is useful to parallelize searches, parallelize generation, and make failure recovery and replication easier. Storage size is the limiting factor. A set of 1.5 billion compounds occupies 452 GB of FPB files. The compounds can be split into shards of 10 million, although sharding does limit the effectiveness of the chemfp search algorithms. The time-to-answer is 150 seconds searching all shards. Fingerprint screening bandwidth is the limiting factor.

Fingerprints should be very compressible. To shrink fingerprint storage, Brian looked at two off-the- shelf solutions: gzip (zlib)132 and Zstandard.133 Zstandard is designed for tunable compression speeds, rapid decompression, and small data; fingerprint generation is “generate once, use many”’, so the method is ideal for doing more computation upfront, once. Even though compression is compute- intensive, decompression stays very fast. Storage is reduced to 25 bytes per molecule with Zstandard compression of FPB files, but decompression speed is where Zstandard really shines. Zstandard level 18 compresses to 37 GB at 2.58 MB/second compression and 700 MB/second decompression; gzip-9 compresses to 56 GB at 1.29 MB/second compression and 190 MB/second decompression. Brian tested how both methods fare in end-to-end testing fully parallelized across the cluster (one core per shard). Zstandard-compressed FPB files were better because they did not saturate the network and make other users unhappy. 25

The heap queue algorithm134 in Python is very useful. Assuming that 16,000 files is the maximum that heapq.merge can handle, 100 million molecules per chemfp FPB file seems to be the current maximum. Multiplying 16,000 by 100 million means 1.6 trillion molecules, occupying 41 TB when compressed by Zstandard. A 60 TB solid-state drive storage cluster can achieve 2500 MB/second search speed, so searching 41TB at 2500 MB/second would take five hours, a totally acceptable time.

In future the researchers plan to use the Zstandard training mode to compress smaller blocks; to work, perhaps, with chemfp indexing to decompress only the necessary blocks; and to not store the index-to-fingerprint lookup file in the FPB file.

Advances in searching ultralarge chemical databases of over 100 billion compounds. Arthor and SmallWorld Roger Sayle, Richard Gowers and John Mayfield, NextMove Software, Cambridge, U.K.

There are two possible strategies for tackling chemical search in ultralarge chemical databases. The first, the evolutionary approach, is to push existing linear scaling algorithms as fast as they will go on modern computer hardware, as exemplified by NextMove Software’s Arthor.6,99 The second, the revolutionary approach, is to develop novel, innovative, significantly sublinear scaling algorithms, as exemplified by NextMove Software’s SmallWorld.98,99 Both these strategies take advantage of technical improvements in computer hardware, particularly storage technology.

A few years ago, 100 million compounds in PubChem could be searched in 25 seconds; 1.4 billion in Enamine REAL took nearly six minutes. Today about 400 million molecules per second can be searched, but scaling is still linear. For nontrivial datasets that do not fit in cache, bandwidth becomes the performance bottleneck. Modern multicore architectures assume a high computation density. On Intel or AMD chips, the number of memory channels and speed of the memory is sometimes more important than the number of cores. Bandwidth limits the value of cores on CPUs and GPUs or of nodes on a cloud.

Most of the time spent in traditional chemical database engines is not actually in the SMARTS matcher, but in representation conversion. Arthor uses a compact binary representation that is ready to search (the same bytes on disk as when processed in memory). Another major advantage of Arthor is efficient encoding (compression) of connection tables in a form that can be processed. Arthor requires 92 bytes on average per molecule; the majority of molecules require only 2 bytes per atom and 2 bytes per bond.

Arthor’s matching optimizations include focusing on addressing “worst case” pathological behavior rather than fingerprint prescreens; advanced symmetry perception and pruning; and SMARTS (and sketch) query optimization. “Search as you draw” becomes possible: response times allow updates as changes are made. Tautomerism and resonance forms cause problems for SMARTS matching. Arthor identifies and filters relevant tautomers and protomers.

Bandwidth limitations also apply when you are computing the Tanimoto similarity of binary fingerprints. Fingerprints can be stored without storing the population count of each fingerprint explicitly, but the biggest trade-off is the length of each fingerprint and this can be tuned. For ultralarge databases, 256 bits are sufficient; for ChEMBL or PubChem, 1024 is reasonable. Even at 26

256 bits, ECFP4 fingerprints outperform the path-based fingerprints of traditional search engines. A side-effect of progress is that ECFP4 has a significantly lower bit density than path-based fingerprints. Tanimoto values above 0.345 are now biologically significant but path-based fingerprints traditionally have a 0.7 threshold. Unfortunately, bounds-based optimizations129,130 are less effective when using low-density radial fingerprints, such as ECFP, but simpler multithreading and other benefits more than make up for bounds pruning. A search that looks at over 50% of a database is O(N).

The application of just-in-time compilation, the process of turning similarity queries into executable machine code, enables a significant bandwidth breakthrough. Not every word of a fingerprint is necessary for calculating a Tanimoto coefficient: by reading only relevant words, memory bandwidth is used more effectively.

Next Move analysis of traditional fingerprint search systems revealed that, typically, sorting not searching is the bottleneck. The search phase is O(N), but sorting the results is typically O(N.logN) for nontrivial numbers of hits. Arthor uses an efficient O(N) two-pass counting sort. It processes more fingerprints per second than five other “fast search” programs. The federated search, Round-table, allows Arthor databases to be distributed and split across multiple hosts and virtualized as a single database. It eliminates Arthor’s 4-byte row limit.

Search, however, is only one aspect of ultralarge chemical database management. Other processes include data loading, indexing, and fingerprinting; graph canonicalization; hashing; join performance with external databases; calculation of druglikeness properties; presorting SMILES; and clustering and diversity selection. With pre-indexing (sorting), lookup and duplicate removal go from O(N) to O(logN). Similarly, tautomer, Bemis-Murko scaffold, and matched pair search can be speeded up by pre-indexing. SmallWorld pre-indexes subgraphs to simplify MCS from nondeterministic polynomial time to about O(1). Graph-edit distance (GED) searches of even very large databases now only require about 10 MB of bandwidth. The spatial data structure requires more than 40 TB but it is computed only once.

Edit distance is a measure of similarity between two discrete mathematical objects. GED is a similarity metric between graphs. It is the minimum number (or cost) of edit operations required to transform one graph into another. SmallWorld indexes the topological space of organic molecules into anonymous graphs. Vertices of the graph are labeled with molecular graphs. Each graph is connected to its neighbors by elementary steps in GED space, such as add a terminal atom, delete a terminal atom, open a ring, close a ring, insert a linker atom, and delete a linker atom.

The sublinear behavior of SmallWorld’s nearest neighbor calculation makes it faster than fingerprint- based similarity methods for large datasets. In SmallWorld, latency not bandwidth (or computing) becomes the limiting factor.

Efficient 3D exploration of multibillion-compound spaces Christian Lemmen, BioSolveIT, Sankt Augustin, Germany

The search for novel intellectual property in vast chemical spaces that comprise make-on-demand molecules is experiencing a wave of successes. 2D similarity searching has been discussed earlier in this report (in papers by Yurii Moroz, Daniel Kuhn, Uta Lessel, and Matthias Rarey) and other 27

successes have been reported but, as Rarey pointed out, the challenge now lies in 3D. Linearly scaling algorithms are doomed to fail with the exponential growth of these vast libraries. Lemmen disclosed a novel strategy that scales far better and can be used in 3D structure-based design. Chemical Space Docking is based on an anchor-and-grow strategy with reaction-based connections between formalized building blocks, reflecting real chemistry.

First, a building block from the fragment space is placed in the binding site. Building blocks are filtered automatically for unwanted linker positions, low scores, and few interactions, and then manually by , unwanted substructures, and unspecific binding. Libraries are then enumerated with compatible reagents for chosen fragments. In template-based docking, the fragment from step 1 stays in place, an added reagent is flexibly attached, and a second round of filtering and eyeballing is applied to extract the top of the list. Challenges that have been addressed include avoiding futile linker positions, scoring of fragments of differing size, and the efficient processing of sizable enumerated libraries.

An application was docking into SARS-CoV-2 main protease starting from 17 noncovalent crystal structures from the Diamond fragment screening (XChem)135 project. The workflow was as follows:

 docking all REAL Space building blocks  selection of the best 100 building blocks  enumeration of 1,731,819 compounds and docking  scoring 2,292,975 poses  inspection of the best 50,000 poses  selection of 13 candidates.136

This approach explores billions of compounds by docking and virtual hits become real, for example through Enamine, saving a great deal of time and money. So far this approach is not available as a software product, but BioSolveIT has carried out five service projects, two of which were successful, and one of which failed; results are pending for the other two. Researchers with an interest in learning how these projects progress are invited to join the Chemical Space Club.137

SciWalker, a novel, comprehensive semantic chemistry search engine for heterogeneous documents and databases Lutz Weber and Christoph Ruttkies, OntoChem, Halle, Germany

SciWalker138 is a toolkit to create semantic search engines and user interfaces and provide structure search on billions of molecules or chemical reactions. Almost 257 million publicly available documents have been indexed from PubMed, clinical trials, patents, drug labels, and grants. The resulting petabytes of data are searchable with Google BigQuery. A commercial database of about 128 million grants, publications, citations, clinical trials and patents (Dimensions)139 has also been indexed. Through normalization and indexing with ontologies from more than 35 knowledge domains, it allows implementation of semantic searches such as “which tetrazoles have been used in clinical trials for which indications?” Chemical structures are automatically identified during the semantic analysis. SciWalker “knows” 35 million chemical concepts such as which chemical names correspond to the class “corticosteroids”. 28

Federated search of multiple sources is carried out using just one query. Users can search concepts and explore very large domain hierarchies and relationships, and carry out, for example exact, substructure, and similarity search of SAVI reactions in a BigQuery table, or of reactions extracted from U.S. patents. Relationships can be identified between concepts such as “aluminum alloy” and related diseases. The system delivers sentences extracted from documents describing the relationship and it links to the source documents. Clever visualizations make it easy to see complex things such as substantial relationships between monoterpenes and cosmetics.

Especially interesting for data scientists are bulk export features, for example, exporting data directly into smaller Excel sheets or into larger Google BigQuery tables for further processing and analytics. Lutz gave three examples of automation of customer specific research processes: continuous biomarker screening for Boehringer Ingelheim; lab journals for another company; and generation of market insights (identification of relationships between a chemical compound and the high-level concept “company”).

Google BigQuery enables interactive dashboards. Complex technologies applied in the background result in a simple dashboard enabling the user to drill down interactively and explore deeper insights. A study showed that out of 303,973 tetrazoles, 35 compounds were in 906 clinical trials. SciWalker is highly modular allowing users to create their own personalized applications.

The SciWalker RESTful web API, programmed in Java, is based on open access software. Installation on personal computers or cloud systems is straightforward and uses docker images. The API uses encrypted, authenticated communications, a React JavaScript GUI, Java 8 middleware, and OpenChemLib,140 CDK141 or ChemAxon chemistry Web Services,142 and different editors. The document database is MySQL. There are connectors to other databases, and third party web APIs, such as Google BigQuery.

Cloud databases and chemical structure searching Wolf-Dietrich Ihlenfeldt, Xemistry, Glashütten, Germany

Generic SQL databases are versatile for storing information but chemists have specific requirements for queries of chemical structure and reaction data, in addition to standard database queries. For decades, database cartridges have been the technological solution of choice. They add chemistry- specific functions to standard databases. This technology works fairly well, and is an industry standard, but there is a problem in that chemistry databases get bigger every year and they have exceeded the storage and maintenance capabilities of local sites. Fortunately, there are vendors such as Google or Amazon which provide hosted cloud SQL databases of nearly unlimited scalability, at a very affordable price, but none of the major hosts allows users to load cartridges into cloud databases. Cartridges are compiled binary code, specific for their target database system or even database system version, not trustworthy, and could wreak havoc onto cloud systems. No cartridge, however, means no chemistry-specific queries such as substructure search.

Xemistry’s scripting environment includes a query processor with specification parsers including SMARTS, QuerySLN, molfile, PATRAN and so on. Query patterns are read into common structure objects, regardless of encoding. Structure and reaction query instructions can be linked via Boolean and other operators to complex query expressions. A query optimizer, and hooks into file I/O 29

module accelerator functions, are integrated. The technology can be applied to single objects, in- memory datasets, plain and accelerated structure and reaction data files, and virtual files.

Can this query capability be extended to work also on external databases: local ones, or those in the cloud, to support universal scripting from single molecule tests to sifting through billions of compounds? An idea is to translate these initially toolkit-specific queries into pure SQL. The difficulty ranges from trivial (identity queries) to rather difficult (sub and superstructure queries). The approach relies heavily on client-side preprocessing and translation of the query into borderline unreadable, complex SQL. You are not expected to write such queries yourself, as you often can with typical cartridge-style queries, where you may insert a few cartridge function calls into generic SQL expressions. Rather, these queries are computer-generated from within the scripting environment. If you are going to use pure SQL, and database tables with chemistry, the data will need to be basic, and the SQL portable.

Substructure search in pure SQL has been done before143 but the published method does not work for real-life queries and database structures, even if formulated in a bond-centric fashion and with structure or substructure segment equivalence preprocessing. A better way is to perform the atom- by-atom matching by an explicitly scripted path. This can be done with temporary, transient tables. The method is as follows. Get all the matches of a (preferably exotic) substructure atom on the atoms for the current test structure in a table. Use this table to match all possibilities to match one of the atoms connected to the first, together with the bond attributes of the link bond, generating a table of two-atom matches. Repeat with other atoms connected to the first (yielding a 3, 4, ... column match table), and when these are all handled, that of the second etc. matched atoms which have not yet been treated by neighbor expansion, until all atoms and bonds have been matched. Use special consideration for ring links, and disconnected query fragments. The structure connectivity is stored in a simple pair of atom and bond tables, not in any standard format (such as SDfile) or preparsed cartridge binary blob.

Common Table Expressions (CTEs) are used to generate a temporary named set, such as a temporary table, that exists for the duration of a query. The proposed substructure query method maps well to SQL CTEs and can be elegantly implemented in a recursive CTE, or more efficiently as a sequence of #ssatom+2 chained simple CTEs. This is a nonrecursive, breadth-first algorithm, which is not optimal if you are only interested in a yes/no match, but does not often cause problems for real-life queries. On the other hand, it can directly provide a count of possible matches, which can be of interest.

Xemistry have demonstrated that many typical pure SQL substructure queries on a 1.8 million compound test dataset can be done on a workstation PC with a standard, unsharded, single- threaded database (like a MySQL or PostgreSQL instance) in less than a second. In some cases, this can actually be faster than a cartridge. The toolkit query optimizer merges the screens of multiple substructures into a single, much more selective screen. That is an optimization which is not accessible to the SQL query optimizer which does not know about screen relationships between different calls into typical cartridge SQL functions.

Screening performance is almost identical to the cartridge version. Single-thread atom-by-atom match performance on a recent workstation CPU was 2500-10,000 complete substructure match attempts per second. Performance loss factor versus a cartridge search was 1-15, depending on the 30

query. For interactive queries (with reasonable but not excellent screening efficiency, where you do not want to see more than 10,000 hits, and where you need an answer in 30 seconds) this translates into an approximate limit of 5-20 million compounds per table shard and thread. For scripted queries where you are willing to wait a few minutes, but not hours, table shards of up to 100 million compounds are reasonable.

To date, the methodology has been implemented for full structure with or without stereochemistry, isotopes, or tautomers; substructure (a simple hit check); superstructure; formula (extended ConQuest syntax); similarity (not just Tanimoto); and nearly complete SMARTS support (and, indirectly, support for other query specification formats). Functionality that is still missing includes stereochemistry matching; recursive SMARTS; substructure match mode manipulation; reaction queries; and R-group variability.

Chemical space is infinite. How can one scale to infinity while still retaining usability and usefulness? Evan Bolton, NLM, NIH, USA

Databases of over 100 million synthesizable (“make on demand”) chemical structures are increasingly available from chemical vendors. In addition, extremely large databases of enumerated organic small molecules like GDB-17 contain more than 100 billion records. For practical purposes, are these useful? Will users know how to use them? Will InChIKey still work? Is interactivity a thing of the past? Are the possibilities so great that it all just seems random?

PubChem40 is one of the largest and most heavily used chemical information databases in the world. With over 250 million bioactivities, 250 million substances, 100 million compounds, and 1 million bioassays, PubChem helps researchers make sense of the biological roles and health effects of chemicals on human health and the environment. If PubChem had 1 billion structures, or 10 billion structures would it be more useful, or less?

Many links among PubChem records exist. Many-to-many links abound within a given searchable collection. There are over 700 contributing resources, with many links to other external collections and annotations. Publishers such as Thieme and Springer Nature provide associations between chemicals and DOIs. Crossref,144 PubMed,145 and SciGraph146 provide document level metadata. Many precomputed relationships and analyses exist to further integration and interpretation by users and machines. The richness of data is further enhanced by machine-based semantic relationships. PubChemRDF40 describes a core subset of relationships as machine-readable information triples, and uses ontologies and standard vocabularies for description. If there were tens of trillions of triples, how would you download and use them? If there were 700 TB of PubChem data how would you download them and who would pay for a cloud version? Everything is analyzable or downloadable in pieces interactively, and programmatically.147-149 If PubChem were suddenly 10 times larger it would implode, as it does not scale to such a level with the current set of features. A complete rethink of PubChem would be needed.

Combinations of hardware, software, money, time, and use cases are what hold us back from “chemical infinity”. As database size increases, the number of connections to other records scales as a square of the number of records. The very basic operations of a chemical structure database are 31

O(N), but sorting the results is typically O(NlnN). Can we find better ways to scale basic expected operations?

Can we implement more memory-efficient approaches? Evan considered the economics of a “featureless” chemical structure database in the cloud. A barebones structure searchable database of one million chemical structures would cost about $219 a year. This cost rises to $1,105 for 100 million, $110,460 for 10 billion, and over $110 million for 10 trillion. These are figures for storage in memory, and capacity for only one query at a time. A full-featured chemical search system would likely be 10-100 times more expensive.

There are other practical considerations beyond cost. How does the workflow change for users? What can a user do with a large selection of results? Does one save a list of 1.5 billion structures to come back later and analyze more? Will speed be sufficient when store versus compute-on-the-fly becomes a serious consideration? What decision-making analysis will be useful to users?

How will users import compounds, substances, bioassays, genes, proteins, pathways, literature, and patents, and upload lists of IDs, draw structures, etc., and how much will they be able to export? How will the PubChem team use large inputs on the database side? What can PubChem afford to compute? Every feature has a cost: which do you keep and what can users do with the results? Will users be allowed to filter and how will they do it? How will it affect performance? How will it scale? Will dynamic analysis and histograms be affordable?

Will chemical similarity have meaning in an ultralarge database? Can PubChem afford to compute (and maybe store) a fingerprint? Dare they try 3D similarity and can they afford the GPUs? Will PubChem3D still be feasible? What about analysis and subsetting? Should PubChem provide useful tools (such as the classification browser) that help the user make sense of millions or billions of results? There are many other helpful services and functions. Will specialized services be possible in an “ultralarge PubChem”? Will they allow bulk download of selected sets; annotation and comparison; subsetting and selection; and various analyses? Will users understand? How will “biologics” be handled (including glycans, amino acids, and nucleic acid monomers)?

New approaches are needed with ultralarge databases. These include compact formats that resist enumeration; unprecedented computational efficiency; rethinking the workflow (machines and scripts versus humans); and cost efficiency. Not every database has to have all the features found in PubChem but some key features will be needed if a database is to be useful and relevant to end users. Substantial investment in developing new algorithms, software, and rethinking databases will be necessary for practical utilization of “ultralarge PubChem”.

A collaborative database for chemistry in Google BigQuery Ian Wetherbee, Google Patents and Stephen Boyer, Collabra on behalf of Google, USA

Google Patents150 is a worldwide patents corpus, available under a CC BY 4.0 license, consisting of Google Patents Public Data by IFI CLAIMS Patents Services151 and Google, and Google Patents Research Data by Google, based on data provided by IFI CLAIMS and OntoChem (see the talk by Lutz Weber).

As data get larger, and computation scales with disk space, it becomes expensive to store data that are rarely accessed. A data supplier needs to pay for everyone’s queries and data and to find ways 32

for the users to pay their share. As more users use the data, they all need to go through the same extract-transform-load process. Raw data are not usable data. In Google BigQuery the bulk raw data from multiple databases are loaded, managed, updated, and linked, and are ready to be analyzed. This is efficient: patent users do not perform duplicate tasks, nor do they need to know how to make a PostgreSQL database or learn how to use SQL and Python.

No server setup is needed. This is a relational database in the cloud, linking public, private, and corporate tables. Replicated, distributed storage and the compute cluster are decoupled on the petabyte network for maximum flexibility. Dozens of life science data tables are currently in use (including ChEMBL, SAVI, and Google Patents). Multiple applications are integrated with BigQuery to analyze data, for example, Tableau,152 KNIME, Qlik products,153 SciWalker (see Lutz Weber’s talk), Looker,154 Kaggle,155 Google Data Studio,156 and APIs for Python, R and Java.

Google has made BigQuery chemistry-aware for handling compound databases of a billion compounds. Google curates by computer the full text of documents for chemical entities. Once identified, the chemical entities are then converted into SMILES and InChIKeys which render the documents searchable by structure or substructure, or by InChIKeys. Similarity search is also possible. Ian illustrated this with a Morgan fingerprint search in SAVI. The cost of one-time SMILES indexing (using RDKit)41 is about $10; storage of two terabytes of data costs $20 a month. A single- compound similarity search of SAVI costs about $1. Rule of Five157 or patent data could be joined in.

A workflow enabling ultralarge virtual screens has been reported.10 VirtualFlow is a highly automated, open-source platform with perfect scaling behavior that is able to prepare and screen ultralarge libraries of compounds efficiently, using a variety of docking programs. It was used to prepare freely available, ready-to-dock ligand libraries, with more than 1.4 billion commercially available molecules and to identify a set of structurally diverse molecules that bind to KEAP1 with submicromolar affinity.

Chemical substructure search in ultralarge chemical databases: fast virtual screening with rapid isostere discovery engine (RIDE) Eugene Raush, Molsoft, San Diego, CA, USA

Raush presented two new methods for mining ultralarge chemical databases. The first method, called Giga-Search,158 implements an extremely fast chemical substructure search algorithm. The traditional approach to substructure or similarity search using fingerprints may take minutes to hours to get results from an ultralarge database. The bottleneck is the amount of fingerprint data to scan through. The new approach is to find an optimal subset of bits dynamically for an input chemical query and design new fingerprint storage to read only a subset of the bits for efficiency.

Significant bits can be found using precalculated fingerprint statistics for a certain database. This information can further be used to filter the database more efficiently without scanning the whole table of fingerprints. Molsoft found that for the average chemical substructure, a set of about 10 significant bits is enough to reduce the chemical search space significantly, meaning that less than 1% of the fingerprint data needs to be read. In the prefiltering stage, an optimal bits subset for a given pattern is found and used to reduce chemical space (the list of row IDs). This stage takes around 2 seconds for any pattern in a database of 1 billion structures. The reduced space is usually 33

less than 0.5% of the total. Next, a “traditional” search is carried out in the reduced chemical space to find the actual hits.

Giga-Search was tested on Enamine REAL 2020 and SAVI 2020. The time to get the first 1000 hits for five fairly simply substructure searches varied between and 3.66 and 4.21 seconds. The substructure search product is provided as an online service, with a GUI from ICM-Pro60 or ICM-Chemist,159 or, for Molcart160 users, as a service hosted locally on a custom database.

The second new Molsoft method is a 3D virtual ligand-based screening algorithm called Rapid Isostere Discovery Engine (RIDE)161 based on Atomic Property Fields (APFs).162 RIDE searches databases of compound conformers for molecules that are isosteric to the query, that is, have similar 3D configurations and distributions of atomic properties. APF is a 3D pharmacophoric potential implemented on a continuously distributed grid which can be used for ligand docking and scoring. APF covariance score is a measure of 3D similarity and alignment quality.

Tested against the Database of Useful Decoys, Enhanced (DUDe) benchmark,163 flexible APF outperforms other methods in accuracy of flexible ligand superposition but it is relatively slow. RIDE uses pregenerated conformers, and using rigid body APF superposition only, it optimizes translations and rotations. A new systematic search algorithm features high level heuristics focusing search on the most relevant space; maximized use of precomputed lookups; the most compact data storage; and concurrent evaluation of many poses. It is much faster than flexible APF, even on a CPU; GPU implementation is orders of magnitude faster. The alignment quality is comparable to that of flexible APF, as measured by the top 2% enrichment factor.

RIDE works with any set of pregenerated conformations and provides both sequential and random access (by ID). The database can be generated either by the Molsoft conformation generator or by converting multiple external SDfiles, but the Molsoft method is 20 times more space efficient than using SDfiles. Contributions of different portions of the molecule can be modulated with per-atom weights to reflect the relative importance of certain moieties. In the excluded volumes feature, an envelope penalty can be applied to the regions that surround all or part of the query molecule to prioritize hits without bulky extensions in constrained regions. Multiple ligands can be used as a consensus template.

GPU-based implementation is capable of searching more than 0.5 million conformers per second on a single GPU card and performs 3D virtual screens of millions of compounds with a level of interactivity comparable to 2D searches; 1 billion compounds can be screened in 8-9 hours. Molsoft provides a number of pregenerated conformational databases; a GUI from ICM-Pro or ICM-Chemist, and a Docker container with RIDE service and web front end.

Gigadocking: structure-based virtual screening of billions of molecules Mark McGann, OpenEye Scientific Software, Santa Fe, NM, USA

Orion,164 OpenEye’s cloud-based drug discovery platform, can provide massive computational power for large scale virtual screening. Using Orion, even relatively complex calculations such as docking can be run on billions of molecules in a single day. OEDocking165 is a suite of molecular docking tools and workflows, including FRED (fast exhaustive docking) for virtual screening. ROCS17 performs shape similarity for virtual screening and lead hopping; the GPU version is FastROCS. 34

AstraZeneca have shown that using Orion, generation of conformer sets for 1010 molecules is feasible within 2−3 days and at a cost of about $20,000. Once generated, this resource can be used for an unlimited number of searches. Searching this huge number of molecules with FastROCS is feasible in minutes, for about $100 per query.166

In 2019, OpenEye screened HSP90 (using the cocrystal structure with an ID of 1UYG in the Protein Data Bank) against 1.4 billion molecules in Enamine REAL.4 FastROCS run time was about 45 minutes. The maximum number of GPUs used was 340. Eventually, 72 compounds were assayed. Run time for Gigadocking with FRED was 18 hours and the maximum number of CPUs used was about 45,000. This was equivalent to a total CPU time of 55 years. Eventually, 105 compounds were assayed. The

IC50 of the ligand in 1UYG was 53 μM. The IC50 of the best of the 72 FastROCS compounds was 18

μM. The IC50 of the best of the 105 compounds assayed after FRED Gigadocking was 4 μM. This best compound was also the number one top scoring molecule of all the Enamine REAL molecules.

OpenEye also collaborated with Beacon Discovery to identify novel active compounds for known GPCR targets in Orion, through FastROCS and Gigadocking. Novel chemical series were identified by both ROCS and docking and were deemed to be worthy of follow-up.

The Orion security model is a foundational part of Orion’s design. It implements industry standard best practices; data are encrypted at rest and in transit; and third party penetration testing has been carried out.

Floe, a graphical workflow engine, makes it simple to construct scripts to take advantage of Orion’s offerings. Workflows consist of cubes written in Python. Each cube performs a simple task, and has input and output ports. Cubes can be easily converted from serial to parallel. A floe is a set of cubes that perform a (possibly) complex task, define connections, and handle versioning, and data provenance. Cubes run concurrently. The Gigadocking floe is a standard floe available to any Orion user. Inputs are the receptor and molecules to dock; the output is a database of top scoring docked poses. A floe is provided with Orion for preparing custom databases. Prepared databases can be filtered for druglikeness, and stereo enumeration and conformation generation can be carried out. Prepared datasets in Orion include Enamine REAL and the diverse version of REAL, Wuxi AppTec’s GalaXi,22 and Mcule ULTIMATE.36

Data storage for large scale Gigadocking and FastROCS is in collections, which are groups of files. Input and output collections are sharded and parallelized. Data are stored securely on Amazon S3. OpenEye had to overcome certain challenges. In collections there is a vast amount of I/O in parallel, and Gigadocking has many small shards. These challenges are complicated by the security model. The scheduler is responsible for allocation of resources (AWS Instances) in Orion. This is invisible to users, but is really the heart of Orion. Early beta versions of Orion made suboptimal resource allocations.

In 2021 OpenEye intends to address the Enamine REAL space. It is estimated that there will be about 25 billion molecules after filtering and stereo enumeration; the storage size of a prepared Gigadocking collection will be about 100 TB; the cost of a FastROCS screen will be about $300; and the cost of a Gigadocking screen will be approximately $250,000. 35

Conclusion A great deal of active research is ongoing in this field but many questions and challenges remain. Some concern the actual content of libraries. Is it diverse enough? A number of questions were asked about the preponderance of coupling reactions and a shortage of ring closures. There is limited consideration of multistep reactions. A multitude of galaxies is being built. Should there be an open universe? Studying the overlap of ultralarge libraries is not currently easy. Some users are also interested in checking for novelty and patentability. Once huge libraries get even larger (e.g., “ultralarge PubChem” and GDB-20) there are formidable challenges, many of them outlined above, but there is also the issue of how the eyes and the brain will be able to cope with visualization and meaningful data abstraction.

More generally, what about a hitchhiker’s guide to the chemical galaxy? If there are 1060 stars we need a map and navigation system for them (e.g., the order of reaction routes or precomputing retrosynthetic routes), just as there is a way to surf the Internet today. Should we concentrate on infrastructure or should we concentrate on novel tools? It is likely that that once the tools have been developed, an infrastructure can be put in place to support a new era of drug discovery in the age of huge datasets.

The last 25 years have been dominated by high throughput, ultrahigh throughput, and virtual screening. We may now be in a position to leave these technologies with their limitations behind us. Fragment-based drug discovery and make-on-demand compound libraries with the respective software enable us to search for bioactives rather than to screen for them. Ever better search tools are being developed but the challenge now lies in 3D computational chemistry applications.

Supply and quality of data are frequently mentioned. The shortage of negative data is a real challenge for machine learning applications. We also the lack of standards to annotate such data adequately (i.e., to define what differentiates negative data in a chemical or biological context). Addressing the problem of getting information in and out of ELNs might help. We should not continue to store information that is not immediately minable, particularly in the context of use for machine learning algorithms aimed at facilitating exploration of chemical space. Predicting reaction conditions with machine learning is difficult because there is no controlled vocabulary. Predicting yields is another challenge. Can we make a model connected to the feasibility score for making a molecule? Is there any open science initiative around this?

Some of the talks mentioned workflows giving access to synthetically accessible molecules in virtual libraries. At the end of the workflow, a molecule must be made: if you want a sample to evaluate and evolve models and hypotheses you have to think about carrying out a reaction. Hence there is an interest in reaction informatics and automation. Chemical synthesis coupled with automation can help augment knowledge acquisition. There has been increasing interest of late in enhancing reaction informatics with chemical synthesis automation. Drug discovery can be accelerated with automated platforms, leading to new chemical knowledge. This will unload routine work from human chemists who will be able to concentrate on innovation.

How do we program synthesis robots if there is a different language from each vendor? We need standard formats to link synthetic routes to robots. Who will generate the standards we need? Who might develop interconversion software? Can we improve our transforms with information from the robots? How do you define and describe a reaction? Do we need new ways of representing and 36

classifying reactions, apart from better open ontologies and enhancements to reaction representations already under development? The interest in reaction informatics and automation merits a follow-up virtual conference, planned for May 2021.

Acknowledgments I am grateful to the organizers for all their support in helping me to prepare this report, and especially to Marc Nicklaus of and Janelle Cortner of NIH. Every speaker was invited to check and correct the text of his or her individual presentation. I am grateful to all the speakers for their helpfulness and cooperation. In order to keep the report to a reasonable length I have limited the number of figures, and I tried, by agreement with the organizers, to keep each summary to fewer than two pages. Most of the presentations are currently available on the NIH website.167

References (1) Hoffmann, T.; Gastreich, M. The next level in chemical space navigation: going far beyond enumerable compound libraries. Drug Discovery Today 2019, 24 (5), 1148-1156. (2) Wilkinson, M. D.; Dumontier, M.; Aalbersberg, I. J. J.; Appleton, G.; Axton, M.; Baak, A.; Blomberg, N.; Boiten, J.-W.; da Silva Santos, L. B.; Bourne, P. E.; Bouwman, J.; Brookes, A. J.; Clark, T.; Crosas, M.; Dillo, I.; Dumon, O.; Edmunds, S.; Evelo, C. T.; Finkers, R.; Gonzalez-Beltran, A.; Gray, A. J. G.; Groth, P.; Goble, C.; Grethe, J. S.; Heringa, J.; t Hoen, P. A. C.; Hooft, R.; Kuhn, T.; Kok, R.; Kok, J.; Lusher, S. J.; Martone, M. E.; Mons, A.; Packer, A. L.; Persson, B.; Rocca-Serra, P.; Roos, M.; van Schaik, R.; Sansone, S.-A.; Schultes, E.; Sengstag, T.; Slater, T.; Strawn, G.; Swertz, M. A.; Thompson, M.; van der Lei, J.; van Mulligen, E.; Velterop, J.; Waagmeester, A.; Wittenburg, P.; Wolstencroft, K.; Zhao, J.; Mons, B. The FAIR Guiding Principles for scientific data management and stewardship. Sci. Data 2016, 3, 160018. (3) Enamine. http://enamine.net/ (accessed March 3, 2021). (4) Enamine REAL. http://enamine.net/library-synthesis/real-compounds/real-database (accessed March 5, 2021). (5) Chemspace. http://chem-space.com/ (accessed March 3, 2021). (6) Arthor from NextMove Software http://www.nextmovesoftware.com/arthor.html (accessed March 8, 2020). (7) ChemAxon MadFast. http://chemaxon.com/products/madfast (accessed March 3, 2021). (8) Lyu, J.; Wang, S.; Balius, T. E.; Singh, I.; Levit, A.; Moroz, Y. S.; O'Meara, M. J.; Che, T.; Algaa, E.; Tolmachova, K.; Tolmachev, A. A.; Shoichet, B. K.; Roth, B. L.; Irwin, J. J. Ultra-large library docking for discovering new chemotypes. Nature 2019, 566 (7743), 224-229. (9) Stein, R. M.; Kang, H. J.; McCorvy, J. D.; Glatfelter, G. C.; Jones, A. J.; Che, T.; Slocum, S.; Huang, X.-P.; Savych, O.; Moroz, Y. S.; Stauch, B.; Johansson, L. C.; Cherezov, V.; Kenakin, T.; Irwin, J. J.; Shoichet, B. K.; Roth, B. L.; Dubocovich, M. L. Virtual discovery of melatonin receptor ligands to modulate circadian rhythms. Nature 2020, 579 (7800), 609-614. (10) Gorgulla, C.; Boeszoermenyi, A.; Wang, Z.-F.; Fischer, P. D.; Coote, P. W.; Padmanabha Das, K. M.; Malets, Y. S.; Radchenko, D. S.; Moroz, Y. S.; Scott, D. A.; Fackeldey, K.; Hoffmann, M.; Iavniuk, I.; Wagner, G.; Arthanari, H. An open-source drug discovery platform enables ultra-large virtual screens. Nature 2020, 580 (7805), 663-668. (11) BioSolveIT FTrees. http://www.biosolveit.de/infiniSee/#FTrees (accessed March 3, 2021). (12) Klingler, F.-M.; Gastreich, M.; Grygorenko, O. O.; Savych, O.; Borysko, P.; Griniukova, A.; Gubina, K. E.; Lemmen, C.; Moroz, Y. S. SAR by space: enriching hit sets from the chemical space. Molecules 2019, 24 (17), 3096. (13) BioSolveIT infiniSee. https://www.biosolveit.de/infiniSee/ (accessed March 3, 2021). (14) Campbell, I. B.; Macdonald, S. J. F.; Procopiou, P. A. Medicinal chemistry in drug discovery in big pharma: past, present and future. Drug Discovery Today 2018, 23 (2), 219-234. 37

(15) Kraut, H.; Eiblmaier, J.; Grethe, G.; Loew, P.; Matuszczyk, H.; Saller, H. Algorithm for reaction classification. J. Chem. Inf. Model. 2013, 53 (11), 2884-2895. (16) Wang, L.; Wu, Y.; Deng, Y.; Kim, B.; Pierce, L.; Krilov, G.; Lupyan, D.; Robinson, S.; Dahlgren, M. K.; Greenwood, J.; Romero, D. L.; Masse, C.; Knight, J. L.; Steinbrecher, T.; Beuming, T.; Damm, W.; Harder, E.; Sherman, W.; Brewer, M.; Wester, R.; Murcko, M.; Frye, L.; Farid, R.; Lin, T.; Mobley, D. L.; Jorgensen, W. L.; Berne, B. J.; Friesner, R. A.; Abel, R. Accurate and Reliable Prediction of Relative Ligand Binding Potency in Prospective Drug Discovery by Way of a Modern Free-Energy Calculation Protocol and Force Field. J. Am. Chem. Soc. 2015, 137 (7), 2695-2703. (17) OpenEye Scientific Software's ROCS. http://www.eyesopen.com/rocs (accessed March 5, 2021). (18) Schindler, C. E. M.; Baumann, H.; Blum, A.; Böse, D.; Buchstaller, H.-P.; Burgdorf, L.; Cappel, D.; Chekler, E.; Czodrowski, P.; Dorsch, D.; Eguida, M. K. I.; Follows, B.; Fuchß, T.; Grädler, U.; Gunera, J.; Johnson, T.; Jorand Lebrun, C.; Karra, S.; Klein, M.; Knehans, T.; Koetzner, L.; Krier, M.; Leiendecker, M.; Leuthner, B.; Li, L.; Mochalkin, I.; Musil, D.; Neagu, C.; Rippmann, F.; Schiemann, K.; Schulz, R.; Steinbrecher, T.; Tanzer, E.-M.; Unzue Lopez, A.; Viacava Follis, A.; Wegener, A.; Kuhn, D. Large-Scale Assessment of Binding Free Energy Calculations in Active Drug Discovery Projects. J. Chem. Inf. Model. 2020, 60 (11), 5457-5474. (19) BioSolveIT CoLibri. http://www.biosolveit.de/products/#CoLibri (accessed March 5, 2021). (20) ZINC database. http://zinc.docking.org/ (accessed March 12, 2021). (21) Schrödinger Glide docking. http://www.schrodinger.com/products/glide (accessed March 5, 2021). (22) GalaXi. http://wxpress.wuxiapptec.com/wuxi-apptec-research-service-division-and- biosolveit-introduce-galaxi-a-vast-new-chemical-space-of-tangible-molecules/ (accessed March 5, 2021). (23) BioSolveIT Knowledge Space http://www.biosolveit.de/infiniSee/?file=KnowledgeSpace_2018-06.space#chemical_spaces (accessed March 5, 2021). (24) Nicolaou, C. A.; Brown, N. Multi-objective optimization methods in drug design. Drug Discovery Today: Technol. 2013, 10 (3), e427-e435. (25) Godfrey, A. G.; Masquelin, T.; Hemmerle, H. A remote-controlled adaptive medchem lab: an innovative approach to enable drug discovery in the 21st Century. Drug Discovery Today 2013, 18 (17-18), 795-802. (26) Nicolaou, C. A.; Watson, I. A.; Hu, H.; Wang, J. The Proximal Lilly Collection: Mapping, Exploring and Exploiting Feasible Chemical Space. J. Chem. Inf. Model. 2016, 56 (7), 1253-1266. (27) Bruns, R. F.; Watson, I. A. Rules for Identifying Potentially Reactive or Promiscuous Compounds. J. Med. Chem. 2012, 55 (22), 9763-9772. (28) NameRxn software from NextMove Software. http://www.nextmovesoftware.com/namerxn.html (accessed April 5, 2020). (29) Nicolaou, C. A.; Watson, I. A.; LeMasters, M.; Masquelin, T.; Wang, J. Context Aware Data- Driven Retrosynthetic Analysis. J. Chem. Inf. Model. 2020, 60 (6), 2728-2738. (30) Chemical Computing Group. Molecular Operating Environment (MOE). http://www.chemcomp.com/Products.htm (accessed March 6, 2021). (31) Nicolaou, C. A.; Humblet, C.; Hu, H.; Martin, E. M.; Dorsey, F. C.; Castle, T. M.; Burton, K. I.; Hu, H.; Hendle, J.; Hickey, M. J.; Duerksen, J.; Wang, J.; Erickson, J. A. Idea2Data: toward a new paradigm for drug discovery. ACS Med. Chem. Lett. 2019, 10 (3), 278-286. (32) Boehm, M.; Wu, T.-Y.; Claussen, H.; Lemmen, C. Similarity searching and scaffold hopping in synthetically accessible combinatorial chemistry spaces. J. Med. Chem. 2008, 51 (8), 2468-2480. (33) Wallach, I.; Dzamba, M.; Heifets, A. S. AtomNet: A Deep Convolutional Neural Network for Bioactivity Prediction in Structure-based Drug Discovery. http://arxiv.org/pdf/1510.02855.pdf (accessed March 6, 2021). 38

(34) Heifets, A. S.; Wallach, I.; Dzamba, M. Systems and methods for applying a convolutional network to spatial data. US9373059, 2016. (35) Rogers, D.; Hahn, M. Extended-Connectivity Fingerprints. J. Chem. Inf. Model. 2010, 50 (5), 742-754. (36) Mcule ULTIMATE. http://ultimate.mcule.com/ (accessed March 15, 2021). (37) Moriwaki, H.; Tian, Y.-S.; Kawashita, N.; Takagi, T. Mordred: a molecular descriptor calculator. J. Cheminf. 2018, 10, 4. (38) Clyde, A.; Duan, X.; Stevens, R. Regression enrichment surfaces: a simple analysis technique for virtual drug screening models. http://arxiv.org/pdf/2006.01171.pdf (accessed March 8, 2021). (39) Babuji, Y.; Blaiszik, B.; Brettin, T.; Chard, K.; Chard, R.; Clyde, A.; Foster, I.; Hong, Z.; Jha, S.; Li, Z.; Liu, X.; Ramanathan, A.; Ren, Y.; Saint, N.; Schwarting, M.; Stevens, R.; van Dam, H.; Wagner, R. Targeting SARS-CoV-2 with AI- and HPC-enabled Lead Generation: a First Data Release. http://arxiv.org/pdf/2006.02431.pdf (accessed March 8, 2021). (40) Kim, S.; Chen, J.; Cheng, T.; Gindulyte, A.; He, J.; He, S.; Li, Q.; Shoemaker, B. A.; Thiessen, P. A.; Yu, B.; Zaslavsky, L.; Zhang, J.; Bolton, E. E. PubChem 2019 update: improved access to chemical data. Nucleic Acids Res. 2019, 47 (D1), D1102-D1109. (41) RDKit: open-source cheminformatics. http://www.rdkit.org (accessed March 30, 2020). (42) Al Saadi, A.; Alfe, D.; Babuji, Y.; Bhati, A.; Blaiszik, B.; Brettin, T.; Chard, K.; Chard, R.; Coveney, P.; Trifan, A.; Brace, A.; Clyde, A.; Foster, I.; Gibbs, T.; Jha, S.; Keipert, K.; Kurth, T.; Kranzlmüller, D.; Lee, H.; Li, Z.; Ma, H.; Merzky, A.; Mathias, G.; Partin, A.; Yin, J.; Ramanathan, A.; Shah, A.; Stern, A.; Stevens, R.; Tan, L.; Titov, M.; Tsaris, A.; Turilli, M.; van Dam, H.; Wan, S.; Wifling, D. IMPECCABLE: Integrated Modeling PipelinE for COVID Cure by Assessing Better LEads. http://arxiv.org/pdf/2010.06574.pdf (accessed March 8, 2021). (43) Autodock-GPU. http://github.com/ccsb-scripps/AutoDock-GPU (accessed March 8, 2021). (44) DeepDriveMD. http://deepdrivemd.github.io/ (accessed March 8, 2021). (45) National Virtual Biotechnology Laboratory (NVBL). http://science.osti.gov/nvbl (accessed March 8, 2021). (46) Poroikov, V. V. Computer-Aided Drug Design: from Discovery of Novel Pharmaceutical Agents to Systems Pharmacology. Biochemistry (Moscow), Suppl. Ser. B: Biomed. Chem. 2020, 14 (3), 216-227. (47) Filimonov, D.; Poroikov, V. Probabilistic Approaches in Activity Prediction. In Chemoinformatics Approaches to Virtual Screening; Varnek, A., Tropsha, A., Eds.; The Royal Society of Chemistry: Cambridge, UK, 2008; pp 182-216. (48) Filimonov, D. A.; Zakharov, A. V.; Lagunin, A. A.; Poroikov, V. V. QNA-based 'Star Track' QSAR approach. SAR QSAR Environ. Res. 2009, 20 (7-8), 679-709. (49) Rudik, A. V.; Dmitriev, A. V.; Lagunin, A. A.; Filimonov, D. A.; Poroikov, V. V. Metabolism Site Prediction Based on Xenobiotic Structural Formulas and PASS Prediction Algorithm. J. Chem. Inf. Model. 2014, 54 (2), 498-507. (50) Way2Drug and PASS. http://way2drug.com/ (accessed March 9, 2021). (51) Poroikov, V. V.; Filimonov, D. A.; Gloriozova, T. A.; Lagunin, A. A.; Druzhilovskiy, D. S.; Rudik, A. V.; Stolbov, L. A.; Dmitriev, A. V.; Tarasova, O. A.; Ivanov, S. M.; Pogodin, P. V. Computer-aided prediction of biological activity spectra for organic compounds: the possibilities and limitations. Russ. Chem. Bull. 2019, 68 (12), 2143-2154. (52) Murtazalieva, K. A.; Druzhilovskiy, D. S.; Goel, R. K.; Sastry, G. N.; Poroikov, V. V. How good are publicly available web services that predict bioactivity profiles for drug repurposing? dol. SAR QSAR Environ. Res. 2017, 28 (10), 843-862. (53) Zakharov, A. V.; Peach, M. L.; Sitzmann, M.; Nicklaus, M. C. A new approach to radial basis function approximation and its application to QSAR. J. Chem. Inf. Model. 2014, 54 (3), 713-719. (54) Lagunin, A. A.; Romanova, M. A.; Zadorozhny, A. D.; Kurilenko, N. S.; Shilov, B. V.; Pogodin, P. V.; Ivanov, S. M.; Filimonov, D. A.; Poroikov, V. V. Comparison of quantitative and qualitative (Q)SAR 39

models created for the prediction of Ki and IC50 values of antitarget inhibitors. Front. Pharmacol. 2018, 9, 1136. (55) Stolbov, L. A.; Druzhilovskiy, D. S.; Filimonov, D. A.; Nicklaus, M. C.; Poroikov, V. V. (Q)SAR models of HIV-1 protein inhibition by drug-like compounds. Molecules 2020, 25 (1), 87. (56) Druzhilovskiy, D. S.; Stolbov, L. A.; Savosina, P. I.; Pogodin, P. V.; Filimonov, D. A.; Veselovsky, A. V.; Stefanisko, K.; Tarasova, N. I.; Nicklaus, M. C.; Poroikov, V. V. Computational approaches to identify a hidden pharmacological potential in large chemical libraries. Supercomputing Frontiers and Innovations 2020, 7 (3), 57-76. (57) Clarivate Cortellis. http://clarivate.com/cortellis/solutions/pre-clinical-intelligence-analytics/ (accessed March 10, 2021). (58) Gaulton, A.; Hersey, A.; Nowotka, M.; Bento, A. P.; Chambers, J.; Mendez, D.; Mutowo, P.; Atkinson, F.; Bellis, L. J.; Cibrián-Uhalte, E.; Davies, M.; Dedman, N.; Karlsson, A.; Magariños, M. P.; Overington, J. P.; Papadatos, G.; Smit, I.; Leach, A. R. The ChEMBL database in 2017. Nucleic Acids Res. 2016, 45 (D1), D945-D954. (59) Synthetically Accessible Virtual Inventory (SAVI). http://cactus.nci.nih.gov/download/savi_download/ (accessed March 10, 2021). (60) MolSoft ICM-Pro. http://www.molsoft.com/icm_pro.html (accessed March 10. 2021). (61) Joint European Disruptive Inititaive (JEDI). http://www.jedi.foundation/ (accessed March 10, 2021). (62) Fink, T.; Bruggesser, H.; Reymond, J.-L. Virtual exploration of the small-molecule chemical universe below 160 D. Angew. Chem., Int. Ed. 2005, 44 (10), 1504-1508. (63) Blum, L. C.; Reymond, J.-L. 970 Million Druglike Small Molecules for Virtual Screening in the Chemical Universe Database GDB-13. J. Am. Chem. Soc. 2009, 131 (25), 8732-8733. (64) Ruddigkeit, L.; van Deursen, R.; Blum, L. C.; Reymond, J.-L. Enumeration of 166 Billion Organic Small Molecules in the Chemical Universe Database GDB-17. J. Chem. Inf. Model. 2012, 52 (11), 2864-2875. (65) Visini, R.; Awale, M.; Reymond, J.-L. Fragment Database FDB-17. J. Chem. Inf. Model. 2017, 57 (4), 700-709. (66) Visini, R.; Arús-Pous, J.; Awale, M.; Reymond, J.-L. Virtual Exploration of the Ring Systems Chemical Universe. J. Chem. Inf. Model. 2017, 57 (11), 2707-2718. (67) Awale, M.; Sirockin, F.; Stiefl, N.; Reymond, J.-L. Medicinal Chemistry Aware Database GDBMedChem. Mol. Inf. 2019, 38 (8-9), 1900031. (68) Buhlmann, S.; Reymond, J.-L. ChEMBL-Likeness Score and Database GDBChEMBL. Front. Chem. 2020, 8, 46. (69) Arús-Pous, J.; Blaschke, T.; Ulander, S.; Reymond, J.-L.; Chen, H.; Engkvist, O. Exploring the GDB-13 chemical space using deep generative models. J. Cheminf. 2019, 11, 20. (70) Wishart, D. S.; Knox, C.; Guo, A. C.; Cheng, D.; Shrivastava, S.; Tzur, D.; Gautam, B.; Hassanali, M. DrugBank: a knowledgebase for drugs, drug actions and drug targets. Nucleic Acids Res. 2008, 36 (Database Issue), D901-D906. (71) Awale, M.; Sirockin, F.; Stiefl, N.; Reymond, J.-L. Drug Analogs from Fragment-Based Long Short-Term Memory Generative Neural Networks. J. Chem. Inf. Model. 2019, 59 (4), 1347-1356. (72) Awale, M.; van Deursen, R.; Reymond, J.-L. MQN-Mapplet: Visualization of Chemical Space with Interactive Maps of DrugBank, ChEMBL, PubChem, GDB-11, and GDB-13. J. Chem. Inf. Model. 2013, 53 (2), 509-518. (73) Nguyen, K. T.; Blum, L. C.; van Deursen, R.; Reymond, J.-L. Classification of Organic Molecules by Molecular Quantum Numbers. ChemMedChem 2009, 4 (11), 1803-1805. (74) van Deursen, R.; Blum, L. C.; Reymond, J.-L. A Searchable Map of PubChem. J. Chem. Inf. Model. 2010, 50 (11), 1924-1934. (75) Reymond Research Group. http://gdb.unibe.ch (accessed March 12, 2021). (76) Probst, D.; Reymond, J.-L. FUn: a framework for interactive visualizations of large, high- dimensional datasets on the web. 2018, 34 (8), 1433-1435. 40

(77) Probst, D.; Reymond, J.-L. SmilesDrawer: Parsing and Drawing SMILES-Encoded Molecular Structures Using Client-Side JavaScript. J. Chem. Inf. Model. 2018, 58 (1), 1-7. (78) Ruddigkeit, L.; Blum, L. C.; Reymond, J.-L. Visualization and Virtual Screening of the Chemical Universe Database GDB-17. J. Chem. Inf. Model. 2013, 53 (1), 56-65. (79) Blum, L. C.; van Deursen, R.; Bertrand, S.; Mayer, M.; Burgi, J. J.; Bertrand, D.; Reymond, J.-L. Discovery of α7-Nicotinic Receptor Ligands by Virtual Screening of the Chemical Universe Database GDB-13. J. Chem. Inf. Model. 2011, 51 (12), 3105-3112. (80) Probst, D.; Reymond, J.-L. A probabilistic molecular fingerprint for big data settings. J. Cheminf. 2018, 10, 66. (81) Capecchi, A.; Probst, D.; Reymond, J.-L. One molecular fingerprint to rule them all: drugs, biomolecules, and the metabolome. J. Cheminf. 2020, 12 (1), 43. (82) Probst, D.; Reymond, J.-L. Visualization of very large high-dimensional data sets as minimum spanning trees. J. Cheminf. 2020, 12, 12. (83) GDBspace. http://gdbspace.com/home (accessed March 12, 2021). (84) Luethi, E.; Nguyen, K. T.; Burzle, M.; Blum, L. C.; Suzuki, Y.; Hediger, M.; Reymond, J.-L. Identification of Selective Norbornane-Type Aspartate Analogue Inhibitors of the Glutamate Transporter 1 (GLT-1) from the Chemical Universe Generated Database (GDB). J. Med. Chem. 2010, 53 (19), 7236-7250. (85) Delarue Bizzini, L.; Muntener, T.; Haussinger, D.; Neuburger, M.; Mayor, M. Synthesis of trinorbornane. Chem. Commun. (Cambridge, U. K.) 2017, 53 (83), 11399-11402. (86) Meier, K.; Arus-Pous, J.; Reymond, J.-L. A Potent and Selective Janus Kinase Inhibitor with a Chiral 3D-Shaped Triquinazine Ring System from Chemical Space. Angew. Chem., Int. Ed. 2021, 60 (4), 2074-2077. (87) Thakkar, A.; Selmi, N.; Reymond, J.-L.; Engkvist, O.; Bjerrum, E. J. "Ring Breaker": Neural Network Driven Synthesis Prediction of the Ring System Chemical Space. J. Med. Chem. 2020, 63, (16), 8791-8808. (88) Thakkar, A.; Chadimová, V.; Bjerrum, E. J.; Engkvist, O.; Reymond, J.-L. Retrosynthetic accessibility score (RAscore) – rapid machine learned synthesizability classification from AI driven retrosynthetic planning. Chem. Sci. 2021, 12, 3339-3349. (89) Rarey, M.; Stahl, M. Similarity searching in large combinatorial chemistry spaces. J. Comput.- Aided Mol. Des. 2001, 15 (6), 497-520. (90) Bellmann, L.; Penner, P.; Rarey, M. Topological Similarity Search in Large Combinatorial Fragment Spaces. J. Chem. Inf. Model. 2021, 61 (1), 238-251. (91) Penner, P.; Martiny, V.; Gohier, A.; Gastreich, M.; Ducrot, P.; Brown, D.; Rarey, M. Shape- Based Descriptors for Efficient Structure-Based Fragment Growing. J. Chem. Inf. Model. 2020, 60 (12), 6269-6281. (92) Bellmann, L.; Penner, P.; Rarey, M. Connected Subgraph Fingerprints: Representing Molecules Using Exhaustive Subgraph Enumeration. J. Chem. Inf. Model. 2019, 59 (11), 4625-4635. (93) Flachsenberg, F.; Meyder, A.; Sommer, K.; Penner, P.; Rarey, M. A Consistent Scheme for Gradient-Based Optimization of Protein–Ligand Poses. J. Chem. Inf. Model. 2020, 60 (12), 6502-6522. (94) University of Hamburg ZBH - Center for Bioinformatics. http://uhh.de/naomi (accessed March 15, 2021). (95) BioSolveIT. http://biosolveit.de (accessed March 15, 2021). (96) Sterling, T.; Irwin, J. J. ZINC 15 - Ligand Discovery for Everyone. J. Chem. Inf. Model. 2015, 55 (11), 2324-2337. (97) ZINC20. http://zinc20.docking.org/ (accessed March 12, 2021). (98) NextMove Software's SmallWorld. http://www.nextmovesoftware.com/smallworld.html (accessed March 12, 2021). (99) Irwin, J. J.; Tang, K. G.; Young, J.; Dandarchuluun, C.; Wong, B. R.; Khurelbaatar, M.; Moroz, Y. S.; Mayfield, J.; Sayle, R. A. ZINC20—A Free Ultralarge-Scale Chemical Database for Ligand Discovery. J. Chem. Inf. Model. 2020, 60 (12), 6065-6073. 41

(100) A Specialized Platform for Innovative Research Exploration (ASPIRE). http://ncats.nih.gov/aspire (accessed March 12., 2021). (101) Next Move Software's HazELNut http://www.nextmovesoftware.com/hazelnut.html (accessed March 12, 2021). (102) Palantir. http://www.palantir.com (accessed March 18, 2021). (103) Zahoranszky-Kohalmi, G.; Wan, K. K.; Godfrey, A. G. Hilbert-Curve Assisted Structure Embedding Method. http://chemrxiv.org/articles/preprint/Hilbert- Curve_Assisted_Structure_Embedding_Method/11911296/1 (accessed March 12, 2021). (104) Ertl, P. Intuitive Ordering of Scaffolds and Scaffold Similarity Searching Using Scaffold Keys. J. Chem. Inf. Model. 2014, 54 (6), 1617-1622. (105) Bemis, G. W.; Murcko, M. A. The Properties of Known Drugs. 1. Molecular Frameworks. J. Med. Chem. 1996, 39 (15), 2887-2893. (106) Van Der Maaten, L. J. P.; Hinton, G. Visualizing High-Dimensional Data Using t-SNE. Journal of Machine Learning Research 2008, 9, 2579-2605. (107) Illuminating the Druggable Genome. http://druggablegenome.net/ (accessed March 13, 2021). (108) Sheils, T. K.; Mathias, S. L.; Kelleher, K. J.; Siramshetty, V. B.; Nguyen, D.-T.; Bologa, C. G.; Jensen, L. J.; Vidović, D.; Koleti, A.; Schürer, S. C.; Waller, A.; Yang, J. J.; Holmes, J.; Bocci, G.; Southall, N.; Dharkar, P.; Mathé, E.; Simeonov, A.; Oprea, T. I. TCRD and Pharos 2021: mining the human proteome for disease biology. Nucleic Acids Res. 2021, 49 (D1), D1334-D1346. (109) Oprea, T. I.; Gottfries, J. Chemography: The Art of Navigating in Chemical Space. J. Comb. Chem. 2001, 3 (2), 157-166. (110) VolSurf. http://www.moldiscovery.com/software/vsplus/ (accessed March 13, 2021). (111) Oprea, T. I.; Zamora, I.; Ungell, A.-L. Pharmacokinetically Based Mapping Device for Chemical Space Navigation. J. Comb. Chem. 2002, 4 (4), 258-266. (112) Rosen, J.; Gottfries, J.; Muresan, S.; Backlund, A.; Oprea, T. I. Novel Chemical Space Exploration via Natural Products. J. Med. Chem. 2009, 52 (7), 1953-1962. (113) Oprea, T. I.; Benedetti, P.; Berellini, G.; Olah, M.; Fejgin, K.; Boyer, S. Rapid ADME Filters for Lead Discovery. In Molecular Interaction Fields: Applications in Drug Discovery and ADME Prediction, Volume 27; Cruciani, G., Ed.; Wiley-VCH: Weinheim, Germany, 2005; pp 249-272. (114) iResearch Library. http://www.sigmaaldrich.com/chemistry/chemistry- services/chemnavigator.html (accessed March 13, 2021). (115) Bologa, C.; Allu, T. K.; Olah, M.; Kappler, M. A.; Oprea, T. I. Descriptor collision and confusion: toward the design of descriptors to mask chemical structures. J. Comput.-Aided Mol. Des. 2005, 19 (9/10), 625-635. (116) Pollock, S. N.; Coutsias, E. A.; Wester, M. J.; Oprea, T. I. Scaffold topologies. 1. Exhaustive enumeration up to eight rings. J. Chem. Inf. Model. 2008, 48 (7), 1304-1310. (117) Wester, M. J.; Pollock, S. N.; Coutsias, E. A.; Allu, T. K.; Muresan, S.; Oprea, T. I. Scaffold Topologies. 2. Analysis of Chemical Databases. J. Chem. Inf. Model. 2008, 48 (7), 1311-1324. (118) Chemistry42. http://insilico.com/chemistry42 (accessed March 13, 2021). (119) Zhavoronkov, A.; Vanhaelen, Q.; Oprea, T. I. Will Artificial Intelligence for Drug Discovery Impact Clinical Pharmacology? Clin. Pharmacol. Ther. 2020, 107 (4), 780-785. (120) Patel, H.; Ihlenfeldt, W.-D.; Judson, P. N.; Moroz, Y. S.; Pevzner, Y.; Peach, M. L.; Delannée, V.; Tarasova, N. I.; Nicklaus, M. C. SAVI, in silico generation of billions of easily synthesizable compounds through expert-system type rules. Sci. Data 2020, 7 (1), 384. (121) Judson, P. N.; Ihlenfeldt, W.-D.; Patel, H.; Delannee, V.; Tarasova, N.; Nicklaus, M. C. Adapting CHMTRN (CHeMistry TRaNslator) for a New Use. J. Chem. Inf. Model. 2020, 60 (7), 3336- 3341. (122) CACTVS toolkit from Xemistry. http://xemistry.com/ (accessed March 13, 2021). (123) SAVI-2020. http://cactus.nci.nih.gov/download/savi_download/ (accessed March 13, 2021). 42

(124) Delannee, V.; Nicklaus, M. C. SAVI à la carte: Moving toward molecules on demand by AI. The development of the SLICE (Smarts and Logic In ChEmistry) language. http://www.morressier.com/article/savi-la-carte-moving-toward-molecules-demand-ai- development-slice-smarts-logic-chemistry-language/5f511d216fdcfc687198a407 (accessed March 13, 2021). (125) Advancing Therapeutic Opportunities for Medicine (ATOM) consortium. http://atomscience.org/ (accessed March 13, 2021). (126) ATOM Modeling PipeLine (AMPL) for drug discovery. http://github.com/ATOMconsortium/AMPL (accessed March 13, 2021). (127) Minnich, A. J.; McLoughlin, K.; Tse, M.; Deng, J.; Weber, A.; Murad, N.; Madej, B. D.; Ramsundar, B.; Rush, T.; Calad-Thomson, S.; Brase, J.; Allen, J. E. AMPL: A Data-Driven Modeling Pipeline for Drug Discovery. http://arxiv.org/pdf/1911.05211.pdf (accessed March 13, 2021). (128) Jacobs, S. A.; Van Essen, B.; Hysom, D.; Yeom, J.-S.; Moon, T.; Anirudh, R.; Thiagaranjan, J. J.; Liu, S.; Bremer, P.-T.; Gaffney, J.; Benson, T.; Robinson, P.; Peterson, L.; Spears, B. Parallelizing Training of Deep Generative Models on Massive Scientific Datasets. http://arxiv.org/pdf/1910.02270.pdf (accessed March 13, 2021). (129) Dalke, A. The chemfp project. J. Cheminf. 2019, 11 (1), 76. (130) Swamidass, S. J.; Baldi, P. Bounds and Algorithms for Fast Exact Searches of Chemical Fingerprints in Linear and Sublinear Time. J. Chem. Inf. Model. 2007, 47 (2), 302-317. (131) The OpenMP API specification for parallel programming. http://www.openmp.org/resources/ (accessed March 14, 2021). (132) gzip zlib. http://docs.python.org/3/library/zlib.html (accessed March 14, 2021). (133) Zstandard. http://facebook.github.io/zstd/ (accessed March 14, 2021). (134) Heap queue algorithm in Python http://docs.python.org/3/library/heapq.html (accessed March 14, 2021). (135) Diamond fragment screening: XChem. http://www.diamond.ac.uk/Instruments/Mx/Fragment-Screening.html# (accessed March 15, 2021). (136) PostEra COVID Mooshot Chemical Space submission http://covid.postera.ai/covid/submissions/8bf1eac9-97bc-474e-bd94-1df6e595278f (accessed March 15, 2021). (137) Chemical Space Club. http://www.linkedin.com/groups/9004052/ (accessed March 15, 2021). (138) SciWalker. http://ontochem.com/#section_semantic_solutions (accessed March 15, 2020). (139) Digital Science Dimensions. http://www.digital-science.com/products/dimensions/ (accessed March 15, 2021). (140) OpenChemLib. http://github.com/actelion/openchemlib (accessed March 15, 2021). (141) Willighagen, E. L.; Mayfield, J. W.; Alvarsson, J.; Berg, A.; Carlsson, L.; Jeliazkova, N.; Kuhn, S.; Pluskal, T.; Rojas-Chertó, M.; Spjuth, O.; Torrance, G.; Evelo, C. T.; Guha, R.; Steinbeck, C. The Chemistry Development Kit (CDK) v2.0: atom typing, depiction, molecular formulas, and substructure searching. J. Cheminf. 2017, 9, 33. (142) ChemAxon JChem web services. http://docs.chemaxon.com/display/docs/jchem-web- services-classic.md (accessed March 15, 2021). (143) Golovin, A.; Henrick, K. Chemical Substructure Search in SQL. J. Chem. Inf. Model. 2009, 49 (1), 22-27. (144) Crossref. http://www.crossref.org/ (accessed March 15, 2021). (145) PubMed. http://pubmed.ncbi.nlm.nih.gov/ (accessed March 15, 2021). (146) Springer Nature SciGraph. http://www.springernature.com/cn/researchers/scigraph (accessed March 15, 2021). (147) Kim, S.; Thiessen, P. A.; Bolton, E. E.; Bryant, S. H. PUG-SOAP and PUG-REST: web services for programmatic access to chemical information in PubChem. Nucleic Acids Res. 2015, 43 (W1), W605- W611. 43

(148) Kim, S.; Thiessen, P. A.; Cheng, T.; Yu, B.; Bolton, E. E. An update on PUG-REST: RESTful interface for programmatic access to PubChem. Nucleic Acids Res. 2018, 46 (W1), W563-W570. (149) Kim, S.; Thiessen, P. A.; Cheng, T.; Zhang, J.; Gindulyte, A.; Bolton, E. E. PUG-View: programmatic access to chemical annotations integrated in PubChem. J. Cheminf. 2019, 11 (1), 56. (150) Google Patents. http://patents.google.com/ (accessed March 16, 2021). (151) IFI Claims. http://www.ificlaims.com/ (accessed March 16, 2021). (152) Tableau. http://www.tableau.com/en-gb (accessed March 16, 2021). (153) Qlik. http://www.qlik.com/us/products (accessed March 16, 2021). (154) Looker. http://looker.com/ (accessed March 16, 2021). (155) Kaggle. http://www.kaggle.com/ (accessed March 16, 2021). (156) Google Data Studio. http://datastudio.google.com/overview (accessed March 31, 2021). (157) Lipinski, C. A.; Lombardo, F.; Dominy, B. W.; Feeney, P. J. Experimental and computational approaches to estimate solubility and permeability in drug discovery and development settings. Adv. Drug Delivery Rev. 2001, 46 (1-3), 3-26. (158) MolSoft Giga-Search. http://www.molsoft.com/giga-search.html (accessed March 16, 2021). (159) Molsoft ICM-Chemist. http://www.molsoft.com/icm-chemist.html (accessed March 16, 2021). (160) Molsoft Molcart. http://molsoft.com/molcart.html (accessed March 16, 2021). (161) Molsoft Rapid Isostere Discovery Engine (RIDE). http://molsoft.com/RIDE.html (accessed Marc h 16, 2021). (162) Totrov, M. Atomic property fields: generalized 3D pharmacophoric potential for automated ligand superposition, pharmacophore elucidation and 3D QSAR. Chem. Biol. Drug Des. 2008, 71 (1), 15-27. (163) Mysinger, M. M.; Carchia, M.; Irwin, J. J.; Shoichet, B. K. Directory of Useful Decoys, Enhanced (DUD-E): Better Ligands and Decoys for Better Benchmarking. J. Med, Chem. 2012, 55 (14), 6582-6594. (164) OpenEye Scientific Software's Orion. http://www.eyesopen.com/orion (accessed March 17, 2021). (165) OpenEye Scientific Software's OEDocking http://www.eyesopen.com/oedocking (accessed March 17, 2021). (166) Grebner, C.; Malmerberg, E.; Shewmaker, A.; Batista, J.; Nicholls, A.; Sadowski, J. Virtual Screening in the Cloud: How Big Is Big Enough? J. Chem. Inf. Model. 2020, 60 (9), 4274-4282. (167) NIH Virtual Workshop on Ultralarge Chemistry Databases. http://cactus.nci.nih.gov/presentations/NIHBigDB_2020-12/NIHBigDB.html (accessed March 30, 2021).