A Consistent Cheminformatics Framework for Automated Virtual Screening

A consistent cheminformatics framework for automated virtual screening Cumulative Dissertation to receive the degree Dr. rer. nat. at the Faculty of Mathematics, Informatics and Natural Sciences University of Hamburg submitted to the Department of Informatics of the University of Hamburg Sascha Urbaczek born in Mannheim Hamburg, August 2014 1. Reviewer: Prof. Dr. Matthias Rarey 2. Reviewer: JProf. Dr. Tobias Schwabe 3. Reviewer: Prof. Dr. Andreas Hildebrandt Date of thesis defense: 23.01.2015 Fürmeine Lena Acknowledgements At this point, I would like to acknowledge the people who supported and encouraged me during my doctoral thesis. First and foremost, I extend my sincere gratitude to my supervisor Prof. Matthias Rarey for entrusting me with this interesting research project and providing guidance and support every step of the way. I also thank the Beiersdorf AG for funding the project and Dr. Inken Groth and Prof. Stefan Heuser in particular for their constant support as well as their great interest in my work. Furthermore, I thank the BiosolveIT GmbH for the productive working relationship during and especially after my time at the Center for Bioinformatics. As for the current and former colleagues from the Center of Bioinformatics (which are far too many to list here); I am grateful to have had the opportunity to work in such a remarkably pleasant and productive environment. I really appreciate everyone's hard work and dedication during these (very successful) last years. In particular I owe a great deal of gratitude to Dr. Tobias Lippert from whom I learned a lot and who has become a very dear friend to me. Last but not least I also thank JProf. Tobias Schwabe and to Prof. Alexander Hildebrandt for reviewing my thesis. Kurzfassung Virtuelles Screening hat sich zu einem integralen Bestandteil der industriellen und akademischen Arzneimittelforschung entwickelt. Es wird eingesetzt, um sehr große Sub- stanzdatenbanken mit der Hilfe von computerbasierten Methoden auf eine überschaubare Zahl vielversprechender Kandidaten zu reduzieren. Um eine möglichst hohe Vorhersage- genauigkeit zu erreichen, wird hierbei versucht, so viele Informationen wie möglich über das Zielprotein und seine bekannten Bindungspartner in die Berechnungen einfließen zu lassen. Die resultierenden Arbeitsprozesse umfassen deshalb häufigmehrere Schritte, in denen die verschiedenen Informationen berücksichtigt werden. Aufgrund der Vielzahl verfügbarerMethodiken und des starken Einflusses jedes einzelnen Schrittes auf das Ergebnis, ist virtuelles Screening eine Aufgabe von enormer Komplexitätund mit vie- len Fallstricken, die fürgewöhnlich nur von Spezialisten durchgeführtwerden kann. Das Ziel der vorliegenden Arbeit war der Aufbau einer verlässliche Basis fürdie Entwicklung von Software, die eine Integration von Medizinalchemikern in die com- putergestützteWikstoffsuche fördert. Das Ergebnis dieser Bemühungen ist ein neues chemieinformatisches Softwareframework (NAOMI) fürvirtuelles Screening, welches speziell auf die damit verbundenen Anforderungen angepasst wurde. Erstens erlaubt NAOMI Medizinalchemikern, basierend auf innovativen und intuitiv verständlichen Konzepten, ihre Erfahrung und ihr Spezialwissen an den Stellen der Berechnungen einzubringen, die fürden Erfolg ihrer Projekte maßgeblich sind. Zweitens umfasst NAOMI zahlreiche neue chemieinformatische Methoden, die auf einem konsistenten internen chemischen Modell aufbauen und eine effiziente Ausführungder einzelnen Schritte des virtuellen Screenings erlauben. Die so erreichte Effizienz ist jedoch nicht nur eine notwendige Voraussetzung fürHochdurchsatz-Screening, sondern spielt auch eine zentrale Rolle in interaktiven Anwendungen, die in Kombination mit einer intu- itiven Benutzerschnittstelle eine Schlüsselrollein der Integration von Medizinalchemik- ern in die computergestützteWirkstoffsuche spielen. Drittens wurde viel Wert darauf gelegt sicherzustellen, dass die Ergebnisse der verschiedenen Rechenschritte chemisch sinnvoll sind und dass eine hoher Grad an Konsistenz zwischen den verschiedenen Kom- ponenten des Prozesses sichergestellt ist. Dies ist von entscheidender Bedeutung, da viele Schritte automatisiert werden müssen,um zu erreichen, dass Medizinalchemiker sich auf die fürsie relevanten Teilaspekte konzentrieren können.Wie in den Publikatio- nen dieser kumulativen Dissertation gezeigt wird, sind zahlreiche Methoden in NAOMI selbst relevante Beiträgein ihren jeweiligen Anwendungsgebieten und ihre Kombination erlaubt den Aufbau effizienter, komplett automatisierter und hoch-adaptiver Screening- Prozesse. Abstract Virtual screening has long since become an integral part of the drug discovery process in both industry and academia. Its main purpose is to reduce huge compound databases to a manageable number of promising drug candidates with the help of computational methods. In order to provide reliable predictions, screening campaigns always aim to incorporate as much knowledge as possible about the target protein and its known binding partners. For that reason, the resulting workflows generally comprise multiple consecutive stages in which the different types of information are processed. The multitude of available methodologies and the strong dependency of the results on each individual step make virtual screening a complex task with numerous pitfalls which is usually only performed by specialized computational chemists. The aim of the presented work was to provide a reliable basis for the development of software applications supporting the inclusion of medicinal chemists in computer-aided drug design activities. The result of this effort is a new cheminformatics framework (NAOMI) for virtual screening which has been specifically designed to meet the demanding requirements imposed by this scenario. First, based on both innovative and intuitively understandable concepts, NAOMI enables medicinal chemists to contribute their expert knowledge and experience at those points of the calculations that are crucial for the success of their current projects. Second, building on the consistent internal chemical model NAOMI comprises numerous novel cheminformatics methods enabling an efficient execution of each individual step of the virtual screening workflow. Apart from being a crucial factor in high-throughput calculations, computational speed is also of the essence in interactive applications which, in combination with intuitive user interfaces, are a key factor in involving medicinal chemists in computational endeavors. Third, great care was taken to ensure that the results of each stage are chemically reasonable and that a high degree of consistency is maintained between the different components of the pipeline. This is important as many steps of the process have to be automated in order to allow medicinal chemists to focus on those aspects relevant for their projects. As is shown in the publications presented in this work, many individual methods included in the NAOMI framework are in themselves relevant contributions to their specific field of application and their combination results in efficient, completely automated, and highly adaptable screening workflows. Contents 1 Introduction 1 1.1 Protein-Ligand Interactions . .3 1.2 Rational Drug Design . .5 1.3 Computer-Aided Drug Design . .7 1.4 Virtual Screening . .9 1.5 Cheminformatics . 11 1.6 Motivation . 12 2 Screening Library 15 2.1 Representation of Molecules . 16 2.2 Representation and Perception of Rings . 20 2.3 Interpretation of Molecules from Chemical File Formats . 25 2.4 Storage of Molecules in Databases . 29 2.5 Selecting Sets of Molecules . 33 2.6 Generation and Selection of Protomers . 37 3 Protein Structure 43 3.1 Representation of Protein Structures . 44 3.2 Structural Data of Protein-Ligand Complexes . 46 3.3 Protomers in Protein-Ligand Complexes . 50 4 Virtual Screening 55 4.1 Representation of Molecular Interactions . 56 4.2 Molecular Docking . 60 4.3 Structure-Based Pharmacophores . 64 4.4 Protomers in Docking . 68 4.5 Virtual High-Throughput Screening . 70 4.6 Inverse Virtual Screening . 72 XI CONTENTS 4.7 Comparison of Binding Pockets . 73 5 Summary and Outlook 75 Bibliography 79 Bibliography of this Dissertation's Publications 95 Appendices 97 A Publications and conference contributions 97 A.1 Publications in scientific journals . 97 A.2 Conferences . 102 B Additional Data 103 B.1 Valence States . 103 B.2 Corresponding Pairs of Valence States . 104 B.3 Substructure Patterns . 106 C Software Architecture 111 C.1 Software Libraries . 111 C.2 Application Software . 117 C.3 Software Testing . 117 D Software Tools 121 E Journal articles 123 XII 1 Introduction Over the course of the last few decades, computers have gained more and more impor- tance throughout all fields of chemistry. They are the only viable means to store and process the huge amount of experimental data produced by chemical research and to perform demanding theoretical calculations with a sufficient level of accuracy. The ever- growing need for more advanced computer-based approaches and improved application programs eventually lead to the establishment of new disciplines, which exclusively focused on the use and development of computational methods to solve chemical prob- lems. The software produced in these fields was initially designed to be used by trained specialists in a supporting capacity. This has, however, considerably changed over the last ten to fifteen years. Nowadays, there is a computer on the desk of nearly every chemist and scientific software

A Consistent Cheminformatics Framework for Automated Virtual Screening

Wannier90 As a Community Code: New

Minimum Cycle Bases and Their Applications

Accelerating Molecular Docking by Parallelized Heterogeneous Computing – a Case Study of Performance, Quality of Results

Universid Ade De Sã O Pa

Open Babel Documentation Release 2.3.1

JRC QSAR Model Database

Lab05 - Matroid Exercises for Algorithms by Xiaofeng Gao, 2016 Spring Semester

BIOINFORMATICS Doi:10.1093/Bioinformatics/Btt213

MACSIMUS Manual

Short Cycles

A Cycle-Based Formulation for the Distance Geometry Problem

Dominating Cycles in Halin Graphs*