Diplomarbeit/Diploma Thesis
Total Page:16
File Type:pdf, Size:1020Kb
DIPLOMARBEIT/DIPLOMA THESIS Titel der Diplomarbeit / Title of the Diploma Thesis “An evaluation of the accuracy of drug – related InChI & InChIKey on ChemSpider, DrugBank, PharmXplorer, PubChem and Wikipedia“ verfasst von / submitted by Joachim Tscherny angestrebter akademischer Grad / in partial fulfilment of the requirements for the degree of Magister der Pharmazie (Mag.pharm.) Wien, 2018 / Vienna, 2018 Studienkennzahl lt. Studienblatt / A 449 degree programme code as it appears on the student record sheet: Studienrichtung lt. Studienblatt / Diplomstudium Pharmazie degree programme as it appears on the student record sheet: Betreut von / Supervisor: Univ.-Prof. Mag. Dr. Gerhard Ecker II Acknowledgments Foremost, I would like to thank my supervisor Gerhard Ecker for the opportunity for realizing this project. Through these thesis, I was able to expand my knowledge extensively. Furthermore, I would like to thank Daniela Digles, who especially supported me at the beginning of my work with the Knime Analytics Platform. I would also like to express my gratitude to Norbert Haider, who provided the data from PharmXplorer. I would like to express my appreciation to the developers of the KNIME Analytical Platform – I could not have carried out the type of computational work without access to this software. I would like to take this opportunity to thank my family, especially Mom, Dad and my sister Katharina, for the continuous and unconditional support they have given me throughout my duration of study. And finally, I extend my personal gratitude to Martina for all the love, patience, and guidance she has given me during the last years. “lucundi acti labores” (Marcus Tullius Cicero, Brutus 70) III IV Abstract Freely available online resources such as ChemSpider, DrugBank, PubChem, and Wikipedia are widely used for obtaining information on drugs. For pharmacy students of the University of Vienna, PharmXplorer is a commonly used source of information. This project investigates whether the drug-related InChI & InChIKey are consistent in the databases ChemSpider, DrugBank, PubChem, and Wikipedia. On the other hand, a gold-standard dataset was created based on the data of the consistency tests, which were used to validate the databases ChemSpider, DrugBank, PubChem, PharmXplorer, and Wikipedia. The workflow tool KNIME Analytics Platform was used to obtain InChI & InChIKey for all drugs approved in Austria from ChemSpider, DrugBank, PubChem, and Wikipedia. The consistency test showed that the total consistency is 79.34%. The database validation revealed that PubChem performed best with a correctness of 96.59%, followed by DrugBank (96.07%), ChemSpider (93.88%), Wikipedia (92.83%) and PharmXplorer (83.94%). All in all, whenever International nonproprietary names used to query InChI & InChIKey in four different databases automatically, this results in at least two different InChIs & InChIKeys in 20% of the cases. V VI Zusammenfassung Frei verfügbare Onlineplattformen wie ChemSpider, DrugBank, PubChem und Wikipedia werden häufig genutzt um an Informationen über Arzneistoffe zu gelangen. Für Pharmaziestudenten der Universität Wien ist der PharmXplorer eine häufig genutzte Informationsquelle. Dieses Projekt untersucht, ob die von den Arzneistoffen zugehörige InChIs und InChIKeys in den Datenbanken ChemSpider, DrugBank, PubChem und Wikipedia konsistent sind. Auf der Grundlage der Ergebnisse des Konsistenztests wurde ein Goldstandard-Datensatz erstellt, der zur Validierung der Datenbanken ChemSpider, DrugBank, PubChem, PharmXplorer und Wikipedia diente. Das Workflow-Tool KNIME Analytics Platform kam zum Einsatz, um die zugehörigen InChIs und InChIKeys aller in Österreich zugelassenen Arzneistoffen von ChemSpider, DrugBank, PubChem und Wikipedia zu erhalten. Das Ergebnis des Konsistenztestes brachte eine Übereinstimmung von 79.34% InChIs. Die Validierung der Datenbanken unter Verwendung des Goldstandard-Datensatzes ergab, dass PubChem mit einer Korrektheit von 96.59% am besten abschnitt, gefolgt von DrugBank (96.07%), ChemSpider (93.88%), Wikipedia (92.83%) und PharmXplorer (83.94%). Wenn der Internationalen Freinamen verwendet wird um automatisch in vier verschiedenen Datenbanken den zugehörigen InChI und InChIKey abzufragen, scheinen in 20% der Fälle mindestens zwei verschiedene InChIs und InChIKeys auf. VII VIII Table of Contents Acknowledgments ..................................................................................................... III Abstract ...................................................................................................................... V Zusammenfassung ................................................................................................... VII Table of Contents ...................................................................................................... IX List of Figures ........................................................................................................... XII List of Table .............................................................................................................. XV 1 Introduction .......................................................................................................... 1 1.1 Motivation of the Thesis ................................................................................. 1 1.2 Statement of the problem .............................................................................. 1 1.3 Research Question ........................................................................................ 3 1.4 Aim of the thesis ............................................................................................ 3 2 Background Methodology .................................................................................... 4 2.1 The Internet – source of information .............................................................. 4 2.2 Definitions ...................................................................................................... 4 2.2.1 ATC-Classification System...................................................................... 4 2.2.2 Molecule Representation ........................................................................ 7 2.2.3 Why InChI & InChIKey .......................................................................... 13 2.3 Databases ................................................................................................... 14 2.3.1 ChemSpider .......................................................................................... 14 2.3.1.1 Content of a ChemSpider entry ...................................................... 14 2.3.1.2 Access ChemSpider webservices .................................................. 15 2.3.2 DrugBank .............................................................................................. 17 2.3.2.1 Content of a DrugBank entry .......................................................... 17 2.3.2.2 Access DrugBank Data .................................................................. 18 2.3.3 PharmXplorer ........................................................................................ 19 2.3.3.1 PharmXplorer information platform ................................................. 19 IX 2.3.4 PubChem .............................................................................................. 21 2.3.4.1 Content of a PubChem Compound entry ........................................ 21 2.3.4.2 Access PubChem: The PubChem API - PUG REST ...................... 22 2.3.5 Wikipedia .............................................................................................. 24 2.3.5.1 Use of drug information .................................................................. 24 2.3.5.2 How Wikipedia works ..................................................................... 25 2.3.5.3 Is Wikipedia reliable........................................................................ 26 2.3.5.4 Content of a drug article ................................................................. 27 2.3.5.5 Use of the Media Wiki API .............................................................. 31 2.4 Used Tools & Software ................................................................................ 34 2.4.1 Knime .................................................................................................... 34 2.4.2 Pywinauto ............................................................................................. 34 3 Development of the Methods ............................................................................. 35 3.1 Preparation for retrieval of Drug Information from ChemSpider, DrugBank, PubChem and Wikipedia ....................................................................................... 35 3.1.1 Retrieval of international nonproprietary names.................................... 35 3.1.2 Extraction of ATC codes from Austrian Medicinal Product Index .......... 42 3.2 Retrieval of Drug Information from ChemSpider, DrugBank, PubChem and Wikipedia .............................................................................................................. 45 3.2.1 Retrieval of Drug Information from ChemSpider ................................... 45 3.2.2 Retrieval of Drug Information from DrugBank ....................................... 48 3.2.3 Retrieval of Drug Information from PharmXplorer ................................. 51 3.2.4 Retrieval of Drug Information from PubChem ....................................... 54 3.2.5 Retrieval