By Abraham Heifets a Thesis Submitted in Conformity with the Requirements

AUTOMATED SYNTHETIC FEASIBILITY ASSESSMENT:ADATA-DRIVEN DERIVATION OF COMPUTATIONAL TOOLS FOR MEDICINAL CHEMISTRY by Abraham Heifets A thesis submitted in conformity with the requirements for the degree of Doctor of Philosophy Graduate Department of Computer Science University of Toronto c Copyright 2014 by Abraham Heifets Abstract Automated Synthetic Feasibility Assessment: A Data-driven Derivation of Computational Tools for Medicinal Chemistry Abraham Heifets Doctor of Philosophy Graduate Department of Computer Science University of Toronto 2014 The planning of organic syntheses, a critical problem in chemistry, can be directly modeled as resource- constrained branching plans in a discrete, fully-observable state space. Despite this clear relationship, the full artillery of artificial intelligence has not been brought to bear on this problem due to its inherent complexity and multidisciplinary challenges. In this thesis, I describe a mapping between organic synthesis and heuristic search and build a planner that can solve such problems automatically at the undergraduate level. Along the way, I show the need for powerful heuristic search algorithms and build large databases of synthetic information, which I use to derive a qualitatively new kind of heuristic guidance. ii Contents Relation to published work xi 1 Introduction 1 2 Prolegomena to any future automated synthesis planning 6 2.1 Search . .6 2.2 AND/OR graph search algorithms . .9 2.2.1 AO* . 10 2.2.2 Learning in Depth-First Search (LDFS) . 12 2.2.3 The question of cycle semantics . 13 2.2.4 Proof Number Search (PNS) . 15 2.2.5 PN* . 19 2.2.6 Proof Disproof Search (PDS) . 21 2.2.7 Depth-First Proof Number and variants (DFPN, DFPN+, DFPN(r), DFPN-TCA, and DFPN-SNDA) . 21 2.3 The shapes of chemistry . 25 2.4 Challenges of organic synthesis . 27 2.5 Automated organic synthesis planners . 29 2.5.1 LHASA and its variants (interactive synthesis planners) . 29 2.5.2 Noninteractive synthesis planners . 31 2.6 Heuristics . 33 2.6.1 Complexity-based . 36 2.6.2 Fragment-based . 38 2.6.3 Machine learning and retrosynthetic analysis . 40 2.7 Reaction libraries . 41 iii 2.8 The critical need for search guidance . 47 3 A declarative description of chemistry 49 3.1 Definitions . 49 4 Retrosynthetic search algorithms 55 4.1 Introduction . 55 4.2 A Proof Number Search-based Solver . 57 4.3 Constructing a Public Chemistry Benchmark . 62 4.4 Results and Discussion . 63 4.5 Chapter Conclusion and Future Work . 65 5 Compilation of synthesis and chemical data 66 5.1 SCRIPDB . 66 5.2 Discussion . 71 5.2.1 Patents as a source of chemical images . 71 5.2.2 Patents as biomedical literature . 72 5.2.3 Patents as a reaction database . 72 5.2.4 Patents as a bioisostere catalog . 73 5.3 Chapter Conclusion and Future Work . 73 6 Domain-specific heuristics for synthetic feasibility 75 6.1 Introduction . 75 6.1.1 Humans, eh? . 77 6.1.2 Objective measures of synthetic feasibility . 80 6.2 Materials and Methods . 81 6.2.1 Data collection . 81 6.2.2 Data cleaning . 87 6.2.3 Data labeling . 88 6.2.4 Data modeling . 90 6.3 Results . 92 6.4 Discussion . 95 6.5 Chapter Conclusion and Future Work . 98 7 Summary & Conclusion 100 iv Appendices 103 A Glossary 104 B Which molecules should be built? 107 B.1 Introduction . 107 B.2 System and methods . 111 B.2.1 Correspondence of bound ligands . 111 B.2.2 Ligand alignment . 113 B.2.3 Residue cluster extraction via clique detection . 115 B.3 Results and discussion . 116 B.3.1 Heme . 117 B.3.2 Nicotinamide adenine dinucleotide . 120 B.4 Chapter Conclusion and Future Work . 126 Bibliography 127 v List of Figures 2.1 Cyclical AND/OR graphs. Semicircles connect paths in a single hyperedge. Double circles indicate goal states. The examples are deliberately simple; equivalent yet more-realistic examples may be generated by replacing simple arcs with larger subgraphs. 14 2.2 If a descendant node can be reached via multiple paths, then Equations 2.1 and 2.2 are no longer correct. An example schematic when a node has repeated precursors showing the (dis)proof counts will be incorrect. In this case, there are only 3 leaf nodes but the root reports a count of4................................................... 18 2.3 Graph history interaction problem from Kishimoto and Muller¨ (2004). Assume node D is a loss for the player at the root. Node B is an AND node, marked by an semicircle connecting its out arcs. Nodes C, G, and H are AND nodes as well but are not marked because they have a single outgoing arc. 24 2.4 Aspirin. Vertices labeled C, H, or O denote carbon, hydrogen, and oxygen atoms, respectively. Edges drawn with single lines denote single bonds and double lines represent double bonds. Typ- ically, bonds to hydrogen are not drawn. 26 2.5 Esterification reaction of an alcohol active site with an anhydride active site to produce an ester. Atoms in the molecular fragments are numbered to help the reader track bond changes. When atomic labels are omitted, the vertex is presumed to be a carbon atom with sufficient hydrogens to total 4 bonds. Reaction conditions have been omitted for simplicity. 26 2.6 Synthesis of aspirin (right column) from carbon dioxide, sodium hydroxide, phenol, acetic acid and ketene starting materials (left column) via the precursor molecules, salicylic acid (top mid) and acetic anhydride (bottom mid). The final aspirin-forming step is an application of the esterification reaction depicted in Fig 2.5. 26 2.7 Palytoxin, a 409-atom molecule synthesized in 1994 (Suh and Kishi, 1994). Bonds depicted as wedges indicate the bond is angled out of the plane of the paper toward the reader. Bonds depicted as dashes indicate the bond is angled away from the reader. 28 vi 2.8 Example from Todd (2005). Structure (40) depicts the quinone Diels-Alder reaction, while (41)- (43) show natural products that had been synthesized using the quinone Diels-Alder. LHASA does not apply the quinone DA to these cases. 30 2.9 8 step synthesis from Takahashi et al. (1990). Molecular complexity proceeds nonmonotonically. 32 2.10 Figure 19 from Boda et al. (2007) depicting minimum, maximum, and average synthetic acces- sibility scores by five medicinal chemists. Structures are sorted by average score. Molecules of particularly wide score disagreement are labeled. Most molecules have scores of 4 ± 1 and many have ranges which overlap. 35 2.11 Figure 7 from Huang et al. (2011) depicting minimum, maximum, and average synthetic accessi- bility scores by five medicinal chemists. Structures are sorted by average score. Most molecules have scores of 4 ± 1 and many have ranges which overlap. 35 2.12 Example comparison of synthetic complexity measures, taken from Barone and Chanon (2001). Synthetic progress is nonmonotonic. 37 2.13 Reaction example from Pirok et al. (2006) depicting a Friedel-Crafts Acylation reaction. Ad- ditional properties specify the charge necessary at the active site for the reaction to complete. Problematic compounds are excluded with additional patterns. 42 2.14 Diels-Alder example from Wilcox and Levinson (1986). Lines 1 and 4 show two reactions. Lines 2 and 3 are the MXC and CXC, respectively, for the first reaction. Line 5 shows the maximum common substructure from the two reactions (note the activating electron-withdrawing oxygen on the dieneophile). 44 2.15 Example from Law et al. (2009). Figure (a) depicts a sample reaction, while (b) and (c) show an extracted core and extended core, respectively. Compare to the MXC and CXC from Figure 2.14. Law et al. note that the non-chemically-essential atom 2 is correctly not included in the extended core; in contrast, a bond radius approach such as (Satoh and Funatsu, 1999) would have included it. 46 4.1 The benchmark target molecules. Images generated directly from the problem definition using OpenBabel O’Boyle et al. (2011a). 58 4.2 Computer-generated Atorvastatin synthesis matching the synthesis reported in Brower et al. (1992) and Roth (2002). ..

By Abraham Heifets a Thesis Submitted in Conformity with the Requirements

Philosophy of Science and Philosophy of Chemistry

Open Data, Open Source, and Open Standards in Chemistry: the Blue Obelisk Five Years On" Journal of Cheminformatics Vol

EXPLORING NEW ASYMMETRIC REACTIONS CATALYSED by DICATIONIC Pd(II) COMPLEXES

Open Data, Open Source and Open Standards in Chemistry: the Blue Obelisk ﬁve Years On

Peptide Chemistry up to Its Present State

Synthesis and Applications of Derivatives of 1,7-Diazaspiro[5.5

Lecture 1: Strategies and Tactics in Organic Synthesis

On the Redundancy of Natural Products Public Databases and Where to Find Data in 2020 - a Review on Natural Products Databases

The Role of Conformational Dynamics in Isocyanide Hydratase Catalysis

[2.2]Paracyclophanes- Structure and Reactivity Studies Von Der

Cheminformatics for Genome-Scale Metabolic Reconstructions

Retrosynthetic Analysis and Synthetic Planning