Cheminformatics in Natural Product-Based Drug Discovery

Cheminformatics in Natural Product-Based Drug Discovery Cumulative Dissertation with the aim of achieving the degree Doctor rerum naturalium (Dr. rer. nat.) at the Faculty of Mathematics, Informatics and Natural Sciences, Department of Informatics, Universitä t Hamburg submitted by Ya Chen born in Zhengzhou, China Hamburg, 2020 The presented thesis was prepared from October 2016 till August 2020 under the supervision of Dr. Johannes Kirchmair at the Department of Informatics, Universitat Hamburg. 1. Reviewer: Dr. Johannes Kirchmair 2. Reviewer: Prof. Dr. Gerhard Wolber Date of thesis defense: 10.11.2020 Abstract Natural products (NPs) remain the single most prolific source of inspiration for small-molecule drug discovery. Boosted by the increasing amount of data available on the chemical, biological, pharmacological and structural properties of NPs, computational approaches have become a mainstay in NP research. In silico methods are particularly useful as decision support tools, allowing experimenta- lists to focus their resources on the most promising directions. However, the current knowledge of the quantity, quality and relevance of the available data as well as of the scope and limitations of cheminformatics methods in NP-based drug discovery is limited. The aims of this PhD thesis are hence to (i) develop a comprehensive under- standing of the data that can be utilized for the advancement and application of in silico methods in the context of NP research, (ii) develop a new method able to identify NPs and NP-like compounds in large compound collections, in order to maximize the use of the available chemical data, and (iii) determine the capacity of a three-dimensional shape-based method to predict the macromolecular targets of complex small molecules such as complex NPs. In the first part of this work a comprehensive perspective on the scope and limitations of in silico methods in NP-based drug discovery is presented. This is followed by an exhaustive review of a large number of virtual and physical NP libraries that are relevant to applications in cheminformatics, especially in virtual screening. One result of this work is a comprehensive, carefully curated virtual collection of 250k NPs. By overlaying this database with a large set of readily obtainable small organic compounds we are able, for the first time, to estimate the number of readily obtainable NPs, which is in the range of 25k (10% of the known NPs). In the next phase of this PhD thesis, we conduct an in-depth analysis of the physicochemical and structural properties of the known NPs, the readily obtainable NPs, and individual NP libraries, as well as compare them with those of approved drugs. An in silico algorithm for removing sugars and sugar-like moieties from NPs and a rule-based approach for the identification of different NP classes are developed. This study shows that NPs are highly diverse. The majority of readily obtainable NPs are found to populate areas in chemical space that are of direct relevance to drug discovery. For several NP databases, a large number of compounds are identified which cover distinct areas in chemical space. One important learning from our survey of compound collections is that NPs are often mixed with NP derivatives and analogs, as well as with synthetic compounds. In fact, substantial numbers of potentially valuable NPs are included in commercial compound collections with no mention of NPs or with no labels that would allow their easy identification. This prompts us to develop a machine i Abstract learning approach that enables the automated cherry-picking of NPs and NP-like compounds from large compound collections. The method is based on a random forest algorithm that obtains a high classification accuracy on holdout data. Moreover, we implement a method that allows the visualization of the areas in a molecule that contribute to the classification of a compound as either a NP or synthetic compound. The best-performing models are provided via a free web service. The final part of this thesis is dedicated to what is currently one of the hottest research topics in cheminformatics, which is the prediction of the macromolecular targets of small organic compounds. NPs pose a particular challenge to such methods because of the scarcity of available bioactivity data on related compounds and the structural complexity of many NPs. The capacity of a three-dimensional shape-based approach is systematically explored to identify the biomacro- molecular targets of structurally complex small molecules (including large and flexible NPs and macrocyclic compounds) based on their similarity to non- complex small molecules (i.e. more conventional, "drug-like" synthetic compounds). This approach obtains good success rates even for compounds that are clearly distinct in their structure from any of the ligands present in the knowledge base. Cases of complete failure are recorded only for a small number of targets. However, complex NPs prove to be challenging even with this robust approach. Overall, this PhD thesis provides a wealth of new information and in-depth knowledge on the available data and cheminformatics methods relevant to natural products-based drug discovery. The study has resulted in accurate models that allow the automated identification and extraction of NPs and NP-like compounds from compound collections, and in a thoroughly validated, three-dimensional shape-based approach for identifying the targets for complex small molecules, especially for complex NPs. ii Zusammenfassung Naturstoffe stellen weiterhin die wichtigste Inspirationsquelle für die Entwicklung moderner Wirkstoffe dar. Mit der zunehmenden Verfügbarkeit experimenteller Daten über die chemischen, biologischen, pharmakologischen und strukturellen Eigenschaften von Naturstoffen konnten sich computergestützte Methoden als eine tragende Technologie in der Erforschung von Naturstoffen etablieren. Die theoretischen Ansätze erlauben es, die limitierten experimentellen Ressourcen in die vielversprechendsten Richtungen zu leiten. Das derzeitige Wissen über die Quantität, Qualität und Relevanz der verfügbaren experimentellen Daten, sowie die Anwendungsbereiche und Grenzen moderner chemieinformatischer Methoden im Bereich der naturstoffbasierten Arzneimittelentwicklung, sind jedoch begrenzt. Die Ziele dieser Doktorarbeit sind daher (i) die Entwicklung eines umfassenden Verständnisses über die verfügbaren experimentellen Daten, welche für die Weiterentwicklung und Anwendung von computerbasierten Methoden im Kon- text der Naturstoffforschung genutzt werden können, (ii) die Entwicklung einer computerbasierten Methode für die automatisierte Erkennung von Naturstoffen und naturstoffähnlichen Verbindungen in großen Moleküldatenbanken (mit dem Ziel die Nutzung der verfügbaren chemischen Daten zu maximieren), und (iii) die Erforschung der Kapazität shape-basierter Methoden, die Zielproteine strukturell komplexer Wirkstoffe, einschließlich Naturstoffe, vorherzusagen. Im ersten Teil dieser Arbeit wird eine umfassende Analyse der Anwendungs- bereiche und Grenzen moderner chemieinformatischer Methoden in der Natur- stoffforschung präsentiert. Anschließend werden die verfügbaren und für die computergestützte Arzneistoffentwicklung relevanten Naturstoffdatenbanken umfassend analysiert. Ein wesentliches Resultat dieser Arbeit ist eine sorgfältig zusammengestellte, umfangreiche, virtuelle Strukturdatensammlung von 250,000 Naturstoffen. Diese Moleküldatenbank wird mit einem umfassenden Datensatz der weltweit verfügbaren Substanzen verglichen. Dadurch kann zum ersten Mal die Anzahl der Naturstoffe abgeschätzt werden, die zeitnahe für eine experimentelle Testung zugänglich sind. Es handelt sich hierbei um etwa 25,000 Substanzen (dies entspricht 10% aller bekannten Naturstoffe). In der nächsten Phase dieser Doktorarbeit werden physikalisch-chemische und strukturelle Eigenschaften der bekannten Naturstoffe und der verfügbaren Natur- stoffe mit jenen der zugelassenen Arzneistoffe verglichen. Im Rahmen dieser Studie werden ein computerbasierter Algorithmus zur Entfernung von Zuckern und zucker-ähnlichen Fragmenten aus Naturstoffen sowie ein regelbasierter Ansatz für die Identifizierung verschiedener Naturstoffklassen vorgestellt. Die Arbeit zeigt die strukturelle Vielfalt der bekannten Naturstoffe. Viele Naturstoffe ähneln in ihren physikalisch-chemischen Eigenschaften jenen der Arzneistoffe, iii Zusammenfassung andere Naturstoffe wiederum unterscheiden sich in diesen Eigenschaften deutlich von Arzneistoffen und decken andere Bereiche des chemischen Raums ab. Eine wichtige Erkenntnis aus dieser Doktorarbeit ist, dass Naturstoffe, deren Derivate und Analoga, und synthetische Verbindungen in virtuellen Substanz- bibliotheken oft gemischt vorliegen und nicht entsprechend gekennzeichnet sind. Deshalb wird im Rahmen dieser Arbeit ein maschinelles Lernverfahren entwickelt, das automatisch Naturstoffe und naturstoffähnliche Substanzen in großen Substanzdatenbanken identifizieren kann. Die Methode basiert auf einem Random-Forest Algorithmus und erzielt eine hohe Klassifikationsgenauigkeit. Zudem wird eine Methode zur Visualisierung der Molekülbereiche, die maßgeblich zur Klassifizierung einer Verbindung als Naturstoff beziehungsweise als synthetische Verbindung beitragen, implementiert. Die besten Modelle sind über einen Web Service kostenlos für die Öffentlichkeit zugänglich. Der letzte Teil der Arbeit widmet sich der computerbasierten Vorhersage der Zielproteine kleiner organischer Verbindungen,

Cheminformatics in Natural Product-Based Drug Discovery

Details

Download

Copyright

We respect the copyrights and intellectual property rights of all users. All uploaded documents are either original works of the uploader or authorized works of the rightful owners.

Support