A Retrosynthetic Analysis Algorithm Implementation Ian A

Watson et al. J Cheminform (2019) 11:1 https://doi.org/10.1186/s13321-018-0323-6 Journal of Cheminformatics SOFTWARE Open Access A retrosynthetic analysis algorithm implementation Ian A. Watson, Jibo Wang and Christos A. Nicolaou* Abstract The need for synthetic route design arises frequently in discovery-oriented chemistry organizations. While traditionally fnding solutions to this problem has been the domain of human experts, several computational approaches, aided by the algorithmic advances and the availability of large reaction collections, have recently been reported. Herein we present our own implementation of a retrosynthetic analysis method and demonstrate its capabilities in an attempt to identify synthetic routes for a collection of approved drugs. Our results indicate that the method, leveraging on reaction transformation rules learned from a large patent reaction dataset, can identify multiple theoretically feasible synthetic routes and, thus, support research chemist everyday eforts. Keywords: Retrosynthetic analysis, Chemical synthesis, Synthetic route design, Reaction informatics Introduction In a typical RA scenario, given a target chemical struc- Research needs for chemical synthesis predictability, ture, the process initiates a recursive decomposition loop synthetic route planning and reaction optimization has using available synthetic knowledge. In each step, the motivated the development of several computational input chemical structure is broken into fragments which tools in recent years [3, 15, 19]. Traditionally, these tools are complemented with functional groups necessary for implement methods based on precedent reaction look-up the reaction to take place. Te virtual reactants, referred or retrosynthetic analysis solutions [8]. Te former relies to as synthons in the remainder of this manuscript, can on the presence of a collection of reactions and attempts then be matched against a database of available building to match the query structure to a known reaction prod- blocks. Te process ends when synthons match available uct (Reaxys; Scifnder). When a match is identifed the building blocks, when the structure under investigation original reaction is retrieved and, often, a search in avail- cannot be broken down further or when a predefned able structural databases is performed to defne availabil- search depth has been reached. ity of reactants. Retrosynthetic analysis (RA) approaches, Key components of the above process include a database frst introduced through the pioneering work of Corey of synthetic reactions to serve as the source of synthetic [5], use chemical reaction rules to deconstruct query knowledge, and, a collection of available building blocks structures into reactants followed by a search for availa- which serves as a look up for synthons. Te proliferation bility in structure collections [18]. Te preparation of the of chemical structure databases renders the latter compo- deconstruction rules may either be assigned to human nent easy to address. For example, researchers can nowa- experts or take place via computational, data-driven pro- days freely access building block collections advertised as cesses [14]. Both reaction look-up and retrosynthetic “readily available” from numerous chemical structure ven- analysis approaches may iterate by using each reactant as dors [7]. In addition, researchers associated with the phar- a new query structure. maceutical industry or sizeable academic institutions often have access to inhouse chemical sample management databases of signifcant size. Access to a comprehensive, free, reliable source of chemical reactions has been more *Correspondence: [email protected] challenging since such data has been typically described in Discovery Chemistry, Lilly Research Laboratories, Eli Lilly and Company, Indianapolis, IN 46285, USA textual form in laboratory notebooks, journal publications © The Author(s) 2019. This article is distributed under the terms of the Creative Commons Attribution 4.0 International License (http://creativecommons.org/licenses/by/4.0/), which permits unrestricted use, distribution, and reproduction in any medium, provided you give appropriate credit to the original author(s) and the source, provide a link to the Creative Commons license, and indicate if changes were made. The Creative Commons Public Domain Dedication waiver (http://creativecommons.org/ publicdomain/zero/1.0/) applies to the data made available in this article, unless otherwise stated. Watson et al. J Cheminform (2019) 11:1 Page 2 of 12 and patents. Collections prepared by large publishing to decompose hypothesis structures into virtual build- houses and professional organizations are only provided ing blocks. Similar to all data-driven methods, the per- on a fee-for-service basis while laboratory notebooks are formance of RTSA depends heavily on the supplied input often difcult to search since their primary design purpose data quality. For the purposes of the work described in has been recording experiments while they are executed, this paper we use the publicly available USPTO reaction not comprehensively searching a posteriori. Of immediate set which can be readily accessed and used [10]. practical use is work reported by Lowe et al. [9] to extract Te operation of RTSA has two distinct phases. Te reactions from patent applications to the United States training phase, RTSA-Train, consumes all available reac- Patent and Trade Ofce (USPTO) and make these reaction data and produces a collection of chemical structure tions available in the public domain [10]. decomposition rules and information required for syn- In this paper we describe our eforts to develop a data- thetic route design. RTSA-Design, receives the hypoth- driven retrosynthetic analysis engine aiming to pro- esis structure(s) and designs possible synthetic routes vide synthetic routes for input chemical structures. We using the collection of decomposition rules obtained dur- thoroughly discuss all implementation aspects includ- ing training. In the following sections the two phases of ing design decisions and algorithmic details and present RTSA are described in detail. results from the training of this engine and the applica- tion to a collection of approved small molecule drugs. RTSA‑train Tis tool, originally developed to serve inhouse needs Te main aim of the training process is to produce and currently in operation, is provided to the commu- Reverse Reaction Templates (RRT) for use during ret- nity in an efort to facilitate research in synthetic route rosynthetic analysis. As the name suggests, RRT’s are design and reaction informatics in general. Emphasis has synthetic transformation rules which work conversely to been placed on developing a tool agnostic to reaction and decompose, i.e. transform, products into building blocks. reactant data specifcs so that interested parties can train Each RRT is derived from a class of highly similar reac- and apply the method to any reaction source and building tions from which annotation data pertinent to the tem- block collection appropriately formatted. plate, e.g. number of reactions, can be calculated. Te process takes as input a set of reactions with atom map- Conceptual design ping in place and, sequentially, reverses and standardizes Te Retrosynthetic Analysis (RTSA) method is a com- the reactions followed by the extraction of an extended putational, data-driven approach designed to identify reaction core. During reversal, the two sides of the reac- potential synthetic routes for a structure of interest, tion, left-hand (reactants) and right hand (products) are referred to in this manuscript as the hypothesis struc- exchanged. Agents, i.e. structures above and below the ture. Te method uses as input a dataset of reaction reaction line (e.g. catalysts, solvents), remain in their examples to prepare retrosynthetic analysis rules used original place (Fig. 1). Fig. 1 a A reaction example from US Patent 04703036 and b the reversed version of the same reaction. The reaction core is highlighted in red on the product side Watson et al. J Cheminform (2019) 11:1 Page 3 of 12 Standardization of the reaction includes, among others, to accommodate varying user preferences the software removal of duplicate fragments, removal of fragments has been implemented to allow fexible atom type defni- that do not participate in the reaction and handling reaction. A full listing of the properties implemented can be tions where there has been an obvious failure of the atom found in Additional fle 1: (SI 1). mapping process. Te extraction of the extended reaction A reaction is however not completely defned by just core requires that the main reaction core is frst defned the changing atoms, or reaction core. Across a given by identifying atoms whose properties (e.g. atomic sym- reaction type, there will be a great diversity of diferent bol, connectivity, ring membership, etc) change during behaviors of such reactions such as varying yields, dif- the reaction. If any of these atomic properties change ferent catalysts needed, diferent solvents, temperatures, from one side of a reaction that atom is considered a etc. Tese diferences can be attributed to the larger changing atom, part of the reaction core [8]. A reaction context of the reaction core as embedded within the example illustrating the changing atoms forming the molecule. To capture more of this context, we apply a reaction core is shown in Fig. 2. recursive Morgan-like expansion [11] around the reac- In

A Retrosynthetic Analysis Algorithm Implementation Ian A

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support