Report on an NIH Workshop on Ultralarge Chemistry Databases Wendy A
Total Page:16
File Type:pdf, Size:1020Kb
1 Report on an NIH Workshop on Ultralarge Chemistry Databases Wendy A. Warr Wendy Warr & Associates, 6 Berwick Court, Holmes Chapel, Cheshire, CW4 7HZ, United Kingdom. Email: [email protected] Introduction The virtual workshop took place on December 1-3, 2020. It was aimed at researchers, groups, and companies that generate, manage, sell, search, and screen databases of more than one billion small molecules (Figure 1). There were about 550 “attendees” from 37 different countries. Recent advances in computational chemistry have enabled researchers to navigate virtual chemical spaces containing billions of chemical structures, carrying out similarity searches, studying structure-activity relationships (SAR), experimenting with scaffold-hopping, and using other drug discovery methodologies.1 For clarity, one could differentiate “spaces” from “libraries”, and “libraries” from “databases”. Spaces are combinatorially constructed collections of compounds; they are usually very big indeed and it is not possible to enumerate all the precise chemical structures that are covered. Libraries are enumerated collections of full structures: usually fewer than 1010 molecules. Databases are a way to storing libraries, for example, in a relational database management system. Figure 1. Ultralarge chemical databases. (Source: Marcus Gastreich based on the publication by Hoffmann and Gastreich.) This report summarizes talks from about 30 practitioners in the field of ultralarge collections of molecules. The aim is to represent as accurately as possible the information that was delivered by the speakers; the report does not seek to be evaluative. 2 Welcoming remarks; defining a drug discovery gateway Susan Gregurick, Office of Data Science and Strategy, NIH, USA Data should be “findable, accessible, interoperable and reusable” (FAIR)2 and with this in mind, NIH has been creating, curating, integrating, and querying ultralarge chemistry databases. Ultimately, though, the aim is to compute on data and information, in order to find better, targeted therapeutics. The community already has industrial databases of building blocks, fragments, screening compounds, reagents, intermediates, and synthetic routes. We have algorithms to measure affinity and predict protein binding, and healthcare records and data on clinical trials. We will be able to collaborate and build platforms based on all this information. In order to develop these networks, we need large scale, computable metadata schema; persistent identifiers; 2D and 3D knowledge graphs and AI; and an ecosystem of high performance computing (HPC) and cloud computing enclaves. There has been a great deal of progress but there is yet more to do in order to refine the drug discovery gateway. Making virtual REAL: an approach to access billions of make-on-demand compounds Yurii Moroz, Chemspace LLC, Kyiv, Ukraine Chemical space is vast: estimates are around 1063 molecules. A major problem in rational drug design is that compounds suggested by the software are often hard or impossible to synthesize. Enamine3 is very proud of the synthesis skills and publication record of its synthetic chemists; chemical knowledge is critical to the “make-on-demand” concept. The company has 240 million “MADE” building blocks which can be made on a gram scale and billions of Readily accessible (REAL)4 screening compounds “made” from running 195 validated synthetic procedures on 130,000 qualified building blocks. Validation is a rigorous process, and reagents are scored on Enamine’s experience of how well they work in robust reactions. For example, if a reductive amination works well in 81% of 293 cases where a certain aldehyde is used, then that aldehyde gets a high score. If only 4% of 54 reductive aminations succeed with a certain aldehyde, then that aldehyde will be excluded from construction of the REAL database. The REAL compounds can be made by parallel synthesis on a mg scale using one-pot chemistry in 1-3 steps. Subsets of the REAL database have been made, for example, a subset of 1.36 billion druglike compounds that can be made by Enamine within 3-4 weeks with a success rate of about 80%. Price and delivery time for these compounds can be guaranteed. The REAL space is 15.5 billion compounds. The actual compounds in the space are not enumerated. The REAL database of enumerated structures can be searched online in the Chemspace5 catalog (using NextMove’s Arthor software),6 or by Chemspace API or by using KNIME. ChemAxon’s MadFast7 is used to search the REAL database in EnamineStore. REAL use cases have been reported.8-10 The REAL space is too big to enumerate but it can be similarity-searched1 using BioSolveIT’s Feature Trees (FTrees)11 pharmacophore-style similarity search software, facilitating virtual high throughput screening. (FTrees is described in more detail below.) Scaffold-hopping is one of the strengths of FTrees. A recent use case has been reported.12 BioSolveIT’s infiniSee13 software is used to navigate the space. The size of chemical space is tremendous, and Enamine has explored only a small part of it, but has delivered a proven success rate in synthesis. 3 Searching for novel chemical hit matter in large chemical spaces Daniel Kuhn, Merck Healthcare KGaA, Darmstadt, Germany Optimizing small molecule drugs is a multiparameter problem.14 The design, development, and synthesis of drugs has been learned by medicinal and computational chemists and honed after years of training and practice. Nowadays, pharma needs to design better drugs faster, therefore compound design and structure activity-relationship (SAR) analysis is moving from an art toward a process. Virtual screening in large chemical spaces is increasingly used to identify novel starting points for hit identification. Searching such a huge space to identify which compounds to make next is a big challenge. Merck AcceSSible InVentory (MASSIV) is Merck’s in-house chemical space of synthetically accessible compounds. It is based on public and in-house chemical reactions in Merck’s electronic lab notebook, an ELN called ELAB, which acts as Merck’s internal knowledge sharing platform. ELAB reactions are classified by InfoChem’s CLASSIFY algorithm.15 The 106 building blocks for MASSIV are from eMolecules, Sigma-Aldrich, and Merck’s own collection. In silico synthesis is carried out using validated reaction spaces resulting from the merger of public and in-house reactions. MiniMASSIV is a subset made by modifying one compound at one site in one reaction. The MASSIV virtual space of 1020 molecules is similarity-searched using FTrees.11 Postfiltering is important in hit selection. Application of MASSIV virtual space searches in projects is combined with medicinal chemistry initiatives. Virtual screening, deep learning, docking, and binding activity prediction using free energy perturbation, FEP+,16 have been used in 14 Merck projects. Proof of concept for synthesis was achieved in six cases; synthesis is in underway in one case; actives have been found in six other cases. Merck have learned a number of lessons from applying smart screening rather than hard screening. Ultralarge chemical spaces can provide interesting chemistry as starting point for hit identification. If you have dedicated parallel chemistry resources you can quickly follow up on the ideas. Out- sourcing to CROs (as Merck does) can be slow and expensive. Search in dedicated make-on-demand chemical spaces such as REAL Space is fast and cost-efficient. Kuhn presented a proof-of-principle for in silico optimization of a fragment to a hit. A scaffold searched in REAL Space gave 903 ideas. These were reduced to 750 by 3D ROCS17 (shape similarity for virtual screening). Docking, and molecular mechanics with the generalized Born model and solvent accessibility method to elicit free energies (MM/GBSA) reduced the 750 to 400. A machine learning model for microsomal clearance reduced the 400 to 250 ideas. Finally FEP gave eight ideas for which the compounds were ordered from Enamine. They took four weeks to arrive, at a cost of less than 100 euros a compound. Five out of eight have IC50 < 100 µM. Merck has reported broad application of FEP+ across multiple targets and series. Screening of large custom-built libraries is an effective way to provide added value in the projects.18 4 Boehringer Ingelheim Comprehensive Library of Accessible Innovative Molecules (BICLAIM) Uta Lessel, Boehringer Ingelheim Pharma GmbH & Co. KG (BI), Biberach, Germany The traditional approach to de novo drug design consists of fragmenting compounds and then joining up the fragments in artificial compound transformations. This results in huge numbers of compounds, many of which may be impossible to synthesize easily. BI’s goal was to combine de novo design with synthetic accessibility. BICLAIM represents trillions of virtual compounds. It is impossible to enumerate all those so BioSolveIT software is used: CoLibri19 transforms synthetic knowledge into chemical spaces; FTrees11 is used to search the fragment space for compounds similar to a known active compound. Chemical fragment spaces consist of molecular fragments and corresponding connection rules. The CoLibri reaction synthesizer takes reaction definitions as an input and generates an individual fragment space for every one of them. The CoLibri fragment space merger takes the output of the reaction synthesizer (multiple individual fragment spaces) and merges them. The result of a search is a list of components that are similar to a query, but in addition, the names of