Enhancing Reaction-Based De Novo Design Using Machine Learning
Total Page:16
File Type:pdf, Size:1020Kb
Enhancing Reaction-based de novo Design using Machine Learning A thesis submitted to the University of Sheffield in fulfilment of the requirements for the degree of Doctor of Philosophy by Gian Marco Ghiandoni This work was sponsored by The University of Sheffield Information School - Faculty of Social Sciences December 2019 Acknowledgements Reaching the end of a doctoral study is something you do not do on your own. There are several people who deserve to be acknowledged for the achievement of such a personal milestone. First, I want to thank my supervisors, Prof Val Gillet, Prof Beining Chen, and Dr Mike Bodkin: Val, in primis , for her precious experience and feedback, and most importantly, for trusting in me during all the time I have spent at Sheffield as a student; Beining, for her support and presence during this journey; Mike, for his continuous encouragement, for the stimulating discussions, and for giving me the opportunity to work in close contact with many skilled scientists. These people gave me the chance to challenge myself in this adventure. The "Reaction Vector Cowboys", my colleagues James Webster and Dr James Wallace, for the close collaboration we have established and managed to preserve during all these years. Most of the concepts that have been formulated and developed in this work are the fruit of hours of discussions with these two exceptional scientists. Next to them, I want to thank Dr Dimitar Hristozov, for his great contribution to my research and the reaction vector project; Dr Antonio de la Vega de León and Dr Alessandro Checco, for their help, especially at the beginning of my studies; Dr Matthew Seddon, Dr Christina Founti, Dr Lucyantie Mazalan, Dr Philip Reeve, Jessica Stacey, Arshnous Marandi, and all the people I have worked with at the University of Sheffield. I also want to thank the people who supported me from far away: My mother, grandmother, sisters, and my two uncles, without whom this could not have been possible. My friends Mario, Marco, Riccardo, Chris, and Tommy for their steady support; Margherita, for encouraging me to pursue my dreams; Luca, for helping and protecting my family. Special acknowledgements are due to Prof Peter Willett, for supporting my research with his vast knowledge; Prof Jon Sayers, allowing me to work with him and the Sheffield Medical School; Dr Richard Mead, for providing me with an exceptional case study to work on and for putting his interest in the development of our techniques; Dr Stuart Flanagan, for carrying out the syntheses of the compounds designed in this work with great commitment; Dr Daniel Lowe, for providing me with the data for my experiments; Marion Leclerc, for her artistic contribution to this work. Finally, I would like to thank Evotec U.K. and the Engineering and Physical Sciences Research Council (EPSRC) for their financial support and assistance. I II Abstract De novo design is a branch of chemoinformatics that is concerned with the rational design of molecular structures with desired properties, which specifically aims at achieving suitable pharmacological and safety profiles when applied to drug design. Scoring, construction, and search methods are the main components that are exploited by de novo design programs to explore the chemical space to encourage the cost-effective design of new chemical entities. In particular, construction methods are concerned with providing strategies for compound generation to address issues such as drug-likeness and synthetic accessibility. Reaction-based de novo design consists of combining building blocks according to transformation rules that are extracted from collections of known reactions, intending to restrict the enumerated chemical space into a manageable number of synthetically accessible structures. The reaction vector is an example of a representation that encodes topological changes occurring in reactions, which has been integrated within a structure generation algorithm to increase the chances of generating molecules that are synthesisable. The general aim of this study was to enhance reaction-based de novo design by developing machine learning approaches that exploit publicly available data on reactions. A series of algorithms for reaction standardisation, fingerprinting, and reaction vector database validation were introduced and applied to generate new data on which the entirety of this work relies. First, these collections were applied to the validation of a new ligand-based design tool. The tool was then used in a case study to design compounds which were eventually synthesised using very similar procedures to those suggested by the structure generator. A reaction classification model and a novel hierarchical labelling system were then developed to introduce the possibility of applying transformations by class. The model was augmented with an algorithm for confidence estimation, and was used to classify two datasets from industry and the literature. Results from the classification suggest that the model can be used effectively to gain insights on the nature of reaction collections. Classified reactions were further processed to build a reaction class recommendation model capable of suggesting appropriate reaction classes to apply to molecules according to their fingerprints. The model was validated, then integrated within the reaction vector-based design framework, which was assessed on its performance against the baseline algorithm. Results from the de novo design experiments indicate that the use of the recommendation model leads to a higher synthetic accessibility and a more efficient management of computational resources. III IV Table of Contents Acknowledgements .............................................................................................................. I Abstract ........................................................................................................................... III Table of Contents .............................................................................................................. V List of Figures .................................................................................................................. XI List of Tables ................................................................................................................ XXI Table of Common Acronyms ........................................................................................ XXV Preface ...................................................................................................................... XXVII Chapter 1: Chemical Representations ............................................................ 1 1.1. Introduction ....................................................................................................... 1 1.2. Molecular Representation ................................................................................... 1 1.2.1. Molecular Graph Theory ............................................................................ 4 1.2.2. Molecular Search Methods .......................................................................... 5 1.3. Reaction Representation ................................................................................... 10 1.3.1. Reaction Mapping .................................................................................... 11 1.4. Reaction Databases .......................................................................................... 13 1.5. Reaction Search Methods ................................................................................. 14 1.6. Reaction Classification ..................................................................................... 15 1.6.1. Model-driven Methods .............................................................................. 15 1.6.2. Data-driven Methods ................................................................................ 18 1.7. Conclusions ...................................................................................................... 22 Chapter 2: De novo Molecular Design ......................................................... 23 2.1. Introduction ..................................................................................................... 23 2.2. The Molecular Design Route ............................................................................ 23 2.3. De novo Design Components ............................................................................ 25 2.4. Scoring Components ......................................................................................... 25 2.4.1. Structure-based Scoring ............................................................................ 26 2.4.2. Ligand-based Scoring ................................................................................ 28 2.5. Construction Components ................................................................................ 30 2.5.1. Atom-based Construction ......................................................................... 30 2.5.2. Fragment-based Construction ................................................................... 31 2.6. Search Components .......................................................................................... 34 2.6.1. Stochastic Search ...................................................................................... 34 2.6.2. Deterministic Search ................................................................................. 36 2.7. Artificial Intelligence in de novo Design ..........................................................