The Next Level in Chemical Space Navigation: Going Far Beyond Enumerable Compound Libraries
Total Page:16
File Type:pdf, Size:1020Kb
REVIEWS Drug Discovery Today Volume 24, Number 5 May 2019 Reviews The next level in chemical space INFORMATICS navigation: going far beyond enumerable compound libraries 1 2 Torsten Hoffmann and Marcus Gastreich 1 Taros Chemicals GmbH & Co. KG, Emil-Figge-Str. 76a, 44227 Dortmund, Germany 2 BioSolveIT GmbH, An der Ziegelei 79, 53757 Sankt Augustin, Germany Recent innovations have brought pharmacophore-driven methods for navigating virtual chemical spaces, the size of which can reach into the billions of molecules, to the fingertips of every chemist. There has been a paradigm shift in the underlying computational chemistry that drives chemical space search applications, incorporating intelligent reaction knowledge into their core so that they can readily deliver commercially available molecules as nearest neighbor hits from within giant virtual spaces. These vast resources enable medicinal chemists to execute rapid scaffold-hopping experiments, rapid hit expansion, and structure–activity relationship (SAR) exploitation in largely intellectual property (IP)-free territory and at unparalleled low cost. Introduction Lead optimization projects in drug discovery often reach a dead For a long time, computational chemists have attempted to end and result in failure. These projects not only try to exploit propose new ideas for lead molecules by tapping into novel IP hitherto undiscovered modes of action, but may also examine the spaces using techniques such as hit expansion, SAR exploitation, possibility of larger molecules as acting agents, such as peptides, and scaffold hopping, yet the past 25 years have shown us that macrocycles, biologics, and their conjugates. However, many re- current computational algorithms and methods simply might search programs simply suffer from a lack of fast access to novel not be capable of identifying highly innovative nearest neighbor chemical classes that may display highly potent and selective phar- molecules with much success. However, novel computational macological effects. On the one hand, access to larger molecular approaches now make it possible to identify new molecules that spaces is required to increase the likelihood of finding something gain wide acceptance from medicinal chemists. Past experience ‘interesting’, whereas, on the other, methods for searching these has shown that it is decisive to incorporate synthetic knowledge spaces must be quicker, more efficient, and easy to use. Current 5 into pharmacophore-based similarity searches in huge virtual chemical spaces range in size from a few thousand to 10 molecules 8 chemistry spaces. This has been done using elegant computation- offered by specialized suppliers offer (‘small’), up to 10 (‘large’) for al algorithms that are extremely fast and easy to use, and that can supplier pools, such as Molport, Chemspace, eMolecules, and 8 take pharmacophore-based information into consideration. It others. Sizes considerably beyond 10 molecules (‘giant’) [2–4] has proved equally important to involve medicinal chemists can be considered as not practically tractable with traditional meth- during the generation of results in silico. The best currently ods for various reasons that we discuss further below. available methods can search spaces close to the 4 billion mole- cule mark, and offer guaranteed delivery of successfully synthe- The pressure to be grand: novelty and IP sized, tangible compounds, making rapid biological Beyond the obvious therapeutic focus, novelty is equally important. characterization from such enormous virtual chemistry spaces Only new IP can be patented and, thus, generate profit to finance a realistic possibility [1]. further research in the search of new therapeutics. The solution to the search space size problem is to expand the realms of possibility Corresponding author. Gastreich, M. ([email protected]) using virtual molecules. One of the most prominent examples is the 1359-6446/ã 2019 The Authors. Published by Elsevier Ltd. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/). 1148 www.drugdiscoverytoday.com https://doi.org/10.1016/j.drudis.2019.02.013 Drug Discovery Today Volume 24, Number 5 May 2019 REVIEWS generic database (GDB) approach by Jean-Louis Reymond’s group in Waller et al. [15] examines synthetic accessibility by retrosynthetic Berne, Switzerland. His lab computationally generated all possible considerations and uses this information to train a proposal engine. organic molecules under boundary constraints and applied various The authors trained AI algorithms with millions of reactions from filters to avoid the creation of unwanted chemistry, such as multiple Elsevier’s Reaxys database to predict promising synthetic pathways. annealed small rings and instable element combinations. Particu- The routes that the machine proposes were said to be ‘on par with larly noteworthy are the graphical representations of chemical [previously] reported routes’. Yet, the IP-related concerns remain, spaces in the respective publications (e.g., [5]). and any algorithm that relies on enumeration (i.e., processing Although rule-based generation (e.g., as conducted in [6] for individual molecules; Fig. 1c) will suffer from prohibitively long macrolide scaffolds) is one approach to create an increasing num- runtimes for very large numbers of molecules. ber of molecular ideas, artificial intelligence (AI) methods have One modern approach in the synthetic organic chemistry lab that also recently gained attention for creating more novel IP in virtual avoids enumeration utilizes combinatorial expansion instead in the molecular matter from a starting point [7]. Irrespective of the formofDNA-encoded libraries (DELs).Theunderlyingtechnique for INFORMATICS general approach to increasing the size of libraries by applying library creation can cover very large numbers of molecules [16,17]. in silico techniques, researchers are now ready to create hundreds Here, millions (or more) of combinatorially generated, putative 8 of millions (10 ) virtual compounds ([8,9] and references therein). small molecule binders are tagged with a ‘signature’ combination Reviews However, mere size causes several challenges for the way in which of DNA snippets, where the DNA acts like a barcode and, thus, is an these compound numbers are processed or searched. Data-handling unambiguous identifier. When screening against an immobilized issues already arise during computer processing and in file or data- protein target, those ligands that bind will stay bound and all others base storage; file sizes easily surpass the terabytemark,searches, even can bewashed off.SubsequentPCRamplification andread-outofthe in cloud-based environments, foreseeably take too much time: a barcodes of the binders enables scientists to identify active small 2012 publication by the Reymond group invested 100 000 central molecules. Nevertheless, using computers as proposal engines processing unit (CPU) hours on 360 processors in parallel to enu- remains of interest to save time and resources before investing into merate molecules with up to 17 heavy atoms [10], documenting the real wet lab time. enormous efforts needed with such large compound pools. To give an example, a comparably small, 1-million compound subset of the Combining fragments: a reborn approach? GDB-13 [10] in uncompressed 3D SD format requires almost 1 GB Retrosynthetically motivated reassembly of fragments in silico has hard disk space; Ruddigkeit et al. [11] stated that the GDB-17 mole- emerged as a method to overcome the computational challenges 11 cules (1.66 Â 10 molecules) require 400 GB disk space in zipped described earlier. The idea is as simple as it is striking, especially if 20 8 format. Thus, 10 molecules would require 2 Â 10 GB or 200 000 one sees the analogy between virtual fragments in the computer TB in compressed format; van Hilten et al. supplied more estimated and building blocks in the wet laboratory: taking only 1000 storage sizes in their recent review [12]. In addition, such large fragments that are allowed to combine via two reactions to form numbers of enumerated virtual molecules have considerable effects (‘de novo’) molecules, an impressive number of 1 billion 9 on the time needed to search through the molecular contents of the (1000 Â 1000 Â 1000 or 10 ) virtual possibilities are already creat- space. State-of-the-art research, using computer code that is opti- ed. Adding more fragments and reactions cause an explosion in mizedfor the hardwareandincorporates numerousadvanced speed- the size of the search space. up strategies has shown that the timings for one substructure search One of the first to pursue this idea was Xiao Lewell and co- in a few hundred million virtual compounds can be reduced to a few workers with their popular RECAP approach [18]. After it was seconds [13]. However, considering the linear scaling behavior of published, it was quickly implemented in various computer pro- this type of search, even these optimized approaches can no longer grams. The initial idea was to break down existing molecules by 12 20 be used for the larger space sizes of 10 or 10 compounds [2] that applying retrosynthetic shredding by the computer and then to are emerging now. reassemble novel compounds in a Lego-like fashion. Later on, Experience has shown that the principle of enumeration can also more elaborate approaches that took existing knowledge about cause patentability concerns because of the finite nature of classical synthetic pathways into